Leveraging Cache Coherence in Active Memory Systems
|
|
- Noel Morrison
- 5 years ago
- Views:
Transcription
1 Leveraging Cache Coherence in Active Memory Systems Daehyun Kim, Mainak Chaudhuri, and Mark Heinrich Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University
2 Outline Introduction Active Memory Techniques Data Coherence Memory Controller Architecture Simulation Results & Analysis Conclusions
3 Motivation Memory Wall Gap between Processor Speed and Memory Speed Solutions: Caching, Prefetching, Data-Intensive Applications Example: Scientific Calculations, Multimedia Operations Characteristics: Huge Data Size Poor Locality New approach: Active Memory Systems
4 Active Memory Systems Active Memory Controller Improving Cache Behavior by Address Re-mapping Memory Controller Processor Cache Address Re-mapping Memory Active Memory Element Data Parallel Processing in Memory Memory DRAM DRAM DRAM DRAM Logic Logic Logic Logic
5 Data Coherence Data Coherence Active Memory Controller: Accessing same data via different addresses Active Memory Element: Multiple processors in the memory system Conventional Approach Cache Flush: Flush overhead Not transparent Hard to support multiprocessor systems Our Approach Leverage and Extend Cache Coherence Protocol: Solves the above problems Supports more active memory techniques
6 Related Work Active Memory Controller Approach Impulse (University of Utah) Memory Forwarding (Carnegie Mellon University) Active Memory Element Approach Active Pages (University of California at Davis) DIVA (University of Southern California) FlexRAM (University of Illinois at Urbana Champaign) Main Contribution of Our Study Transparent Address Re-mapping Techniques Multiprocessor Active Memory Systems New Classes of Active Memory Operations
7 Active Memory Techniques Matrix Transpose Sparse Matrix Linked List Linearization Parallel Reduction
8 Matrix Transpose Example Code for i=0 to N-1 for j=0 to N-1 x += A[i][j]; A for i=0 to N-1 for j=0 to N-1 x += A[j][i]; Problem Column-wise Accesses of Matrix A
9 Matrix Transpose Active Memory Optimization for i=0 to N-1 for j=0 to N-1 x += A[i][j]; A A A = Transpose(A); for i=0 to N-1 for j=0 to N-1 x += A [i][j]; Data Coherence A[i][j] A [j][i]
10 Linked List Linearization Example Code ptr = A; while ptr!= NULL sum += ptr data; ptr = ptr next; A Problem Traversal of Linked List A
11 Linked List Linearization Active Memory Optimization A = Linearize(A); ptr = A ; while ptr!= NULL sum += ptr data; ptr = ptr next; A A Data Coherence i-th Element of A i-th Element of A
12 Data Coherence Problem Processor Cache C1 Dirty C2 Shared D D Memory C0 D C1 D C2 D C3 D
13 Solution to Data Coherence Problem Solution Mutual Exclusion Only one data element among the mapped ones can be cached at a time Implementation Extension of Cache Coherence Protocol Memory controller invalidates any other cache lines that are mapped to the requested line
14 Example: Mutual Exclusion Processor Cache C0 Shared C1 Dirty C2 Shared D D D Memory C0 D C1 D C2 D C3 D
15 Active Memory Controller Architecture PI NI Dispatch Unit Instruction Cache AMPU AMDU Memory Interface Memory Cells Data Cache Active Memory Processor Unit Active Memory Data Unit Data Buffer Send Unit PI NI
16 Example: Matrix Transpose for i=0 to N-1 for j=0 to N-1 x += A[i][j]; A A for A = i=0 Transpose(A); to N-1 for i=0 to N-1 for j=0 to N-1 for j=0 to N-1 x += += A[i][j]; A [i][j]; for i=0 to N-1 for j=0 to N-1 x += A[i][j];
17 Example: Matrix Transpose Cache Memory Controller Memory C0 C1 C2 A A C7
18 Example: Matrix Transpose Cache C1 C2 Dirty Shared Memory Controller Memory C0 C1 C2 A A C7
19 Example: Matrix Transpose Cache C1 C2 Dirty Shared Memory Controller Memory C0 C1 C2 A A C C7
20 Example: Matrix Transpose Cache C1 Dirty C Dirty C2 Shared Memory Controller Memory C0 C1 C2 A A C C7
21 Example: Matrix Transpose Cache C Dirty Memory Controller Memory C0 C1 C2 A A C C7
22 Example: Matrix Transpose Cache C Dirty Memory Controller Memory C0 C1 C2 A A C C7
23 Example: Matrix Transpose Cache C0 C Dirty Memory Controller Memory C0 C1 C2 A A C C7
24 Simulation Environment Simulator Main Processor: MIPS based ISA, 2 GHz Instruction Cache, Data Cache: 2 way set associative, 32 KB Unified Second-level Cache: 2 way set associative, 512 KB TLB: Fully associative, 64 Entries Memory Latency: 125 ns Invalidation-based Bitvector Protocol User-level Cache Flush Applications Matrix Transpose: SPLASH-2 FFT, FFTW, Transpose Sparse Matrix: Conjugate Gradient, SMVP Linked List Linearization: Health, MST, Traverse Parallel Reduction: MMM, Sparse Flow, Reduction
25 Matrix Transpose Normal AM AM+Prefetch Flush Speedup FFT FFTW Transpose
26 Sparse Matrix Normal AM AM+Prefetch Flush Speedup Conjugate Gradient SMVM
27 Linked List Linearization 8 7 Normal AM AM+Prefetch 6 Speedup Health MST Traverse
28 SMP - FFT Normal AM 3 Speedup Number of Processors
29 SMP Parallel Reduction Normal AM 3.5 Speedup MMM Sparse Flow Reduction
30 Conclusions Future Work Multi-node Active Memory Systems Active Memory Elements Summary Data Coherence Problem in Active Memory Systems Cache Coherence Active Memory Architecture Transparent Address Re-mapping Techniques Multiprocessor Active Memory Systems New Classes of Active Memory Operations Simulations Results: 1.3 to 7.7 Speedup
ACTIVE memory systems provide a promising approach
IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004 1 Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems Daehyun Kim, Mainak Chaudhuri, Student Member, IEEE, Mark
More informationSimplifying Active Memory Clusters by Leveraging Directory Protocol Threads
Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads Dhiraj D. Kalamkar Systems Research Center Intel Technology India Pvt. Ltd. Bangalore 56 INDIA dhiraj.d.kalamkar@intel.com Mainak
More informationCS/CoE 1541 Exam 2 (Spring 2019).
CS/CoE 1541 Exam 2 (Spring 2019) Name: Question 1 (5+5+5=15 points): Show the content of each of the caches shown below after the two memory references 35, 44 Use the notation [tag, M(address),] to describe
More informationThe Impact of Negative Acknowledgments in Shared Memory Scientific Applications
134 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 2, FEBRUARY 2004 The Impact of Negative Acknowledgments in Shared Memory Scientific Applications Mainak Chaudhuri, Student Member,
More informationCS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.
CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationVIRTUAL channels in routers allow messages to share a
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 8, AUGUST 2004 1 Exploring Virtual Network Selection Algorithms in DSM Cache Coherence Protocols Mainak Chaudhuri and Mark Heinrich,
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationMemory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez
Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection
More informationHigh performance computing. Memory
High performance computing Memory Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact
More informationMemory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationDISTRIBUTED shared memory (DSM) multiprocessors are
IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 7, JULY 2003 1 Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation Mainak Chaudhuri, Student Member, IEEE, Mark Heinrich, Member,
More informationCSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double
CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double *)malloc(sizeof(double)*n*n); B = (double *)malloc(sizeof(double)*n*n);
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard
More informationINTEGRATED memory controllers appearing in several highend
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007 1 Integrated Memory Controllers with Parallel Coherence Streams Mainak Chaudhuri, Member, IEEE, and Mark Heinrich, Senior
More information15-740/ Computer Architecture
15-740/18-740 Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011 Announcements Milestone II Due November 4, Friday Please talk with us if you
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationThe Impulse Memory Controller
The Impulse Memory Controller Lixin Zhang, Zhen Fang, Mike Parker, Binu K. Mathew, Lambert Schaelicke, John B. Carter, Wilson C. Hsieh, Sally A. McKee September 24, 2001 Abstract Impulse is a memory system
More informationModule 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks
Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware
More information5 Solutions. Solution a. no solution provided. b. no solution provided
5 Solutions Solution 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 S2 Chapter 5 Solutions Solution 5.2 5.2.1 4 5.2.2 a. I, J b. B[I][0] 5.2.3 a. A[I][J] b. A[J][I] 5.2.4 a. 3596 = 8 800/4 2 8 8/4 + 8000/4 b.
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationAS the processor-memory speed gap continues to widen,
IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,
More informationReal-Time Cache Management for Multi-Core Virtualization
Real-Time Cache Management for Multi-Core Virtualization Hyoseung Kim 1,2 Raj Rajkumar 2 1 University of Riverside, California 2 Carnegie Mellon University Benefits of Multi-Core Processors Consolidation
More informationData Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun
Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation
More informationProf. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.
Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: 5.1-5.4, 5.8, 5.10, 5.15; Also, 5.13 & 5.17 Writing to caches: policies, performance Cache tradeoffs and
More informationMake sure that your exam is not missing any sheets, then write your full name and login ID on the front.
ETH login ID: (Please print in capital letters) Full name: 63-300: How to Write Fast Numerical Code ETH Computer Science, Spring 015 Midterm Exam Wednesday, April 15, 015 Instructions Make sure that your
More informationTiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive
More informationIntegrated Memory Controllers with Parallel Coherence Streams
Integrated Memory Controllers with Parallel Coherence Streams Mainak Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Kanpur 20806 INDIA mainakc@cse.iitk.ac.in Abstract
More informationand data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed
5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid
More informationLatency Hiding on COMA Multiprocessors
Latency Hiding on COMA Multiprocessors Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 Abstract Cache Only Memory Access
More informationChapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationMemory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund
Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access Martin Grund Agenda Key Question: What is the memory hierarchy and how to exploit it? What to take home How is computer memory organized.
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More information1993 Paper 3 Question 6
993 Paper 3 Question 6 Describe the functionality you would expect to find in the file system directory service of a multi-user operating system. [0 marks] Describe two ways in which multiple names for
More informationLow-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM
Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data
More information1/19/2009. Data Locality. Exploiting Locality: Caches
Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby
More informationEmbedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi
Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed
More informationCS152 Computer Architecture and Engineering
CS152 Computer Architecture and Engineering Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/fa16 The problem sets are intended to help
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationPARALLEL MEMORY ARCHITECTURE
PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 6 is due tonight n The last
More informationL2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary
HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40
More informationEN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors
EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationLecture 7: Implementing Cache Coherence. Topics: implementation details
Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationA Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads
A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads Mingliang Wei, Marc Snir, Josep Torrellas (UIUC) Brett Tremaine (IBM) Work supported by HPCS/PERCS Motivation Many important
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2019 Caches and the Memory Hierarchy Assigned February 13 Problem Set #2 Due Wed, February 27 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationPageForge: A Near-Memory Content- Aware Page-Merging Architecture
PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign MICRO-50 @ Boston Motivation: Server
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More information18-447: Computer Architecture Lecture 18: Virtual Memory III. Yoongu Kim Carnegie Mellon University Spring 2013, 3/1
18-447: Computer Architecture Lecture 18: Virtual Memory III Yoongu Kim Carnegie Mellon University Spring 2013, 3/1 Upcoming Schedule Today: Lab 3 Due Today: Lecture/Recitation Monday (3/4): Lecture Q&A
More informationEITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor
EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationLecture 5: Directory Protocols. Topics: directory-based cache coherence implementations
Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationCOSC3330 Computer Architecture Lecture 20. Virtual Memory
COSC3330 Computer Architecture Lecture 20. Virtual Memory Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Virtual Memory Topics Reducing Cache Miss Penalty (#2) Use
More informationCS533 Concepts of Operating Systems. Jonathan Walpole
CS533 Concepts of Operating Systems Jonathan Walpole Disco : Running Commodity Operating Systems on Scalable Multiprocessors Outline Goal Problems and solutions Virtual Machine Monitors(VMM) Disco architecture
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationVirtual Memory. Motivation:
Virtual Memory Motivation:! Each process would like to see its own, full, address space! Clearly impossible to provide full physical memory for all processes! Processes may define a large address space
More informationSpring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand
Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationECE 571 Advanced Microprocessor-Based Design Lecture 10
ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 22 February 2018 Announcements HW#5 will be posted, caches Midterm: Thursday
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationPowerPC 740 and 750
368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order
More informationChapter 10: Virtual Memory. Lesson 05: Translation Lookaside Buffers
Chapter 10: Virtual Memory Lesson 05: Translation Lookaside Buffers Objective Learn that a page table entry access increases the latency for a memory reference Understand that how use of translationlookaside-buffers
More informationCache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri
Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationComputer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James
Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationVirtual Memory. Main memory is a CACHE for disk Advantages: illusion of having more physical memory program relocation protection.
Virtual Memory Main memory is a CACHE for disk Advantages: illusion of having more physical memory program relocation protection L18 Virtual Memory 1 Pages: Virtual Memory Blocks Page faults: the data
More informationLecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )
Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case
More informationLecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationECE 411 Exam 1 Practice Problems
ECE 411 Exam 1 Practice Problems Topics Single-Cycle vs Multi-Cycle ISA Tradeoffs Performance Memory Hierarchy Caches (including interactions with VM) 1.) Suppose a single cycle design uses a clock period
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationCaches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first
Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" TLP on Chip: HT/SMT and CMP.
Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 41: "Case Studies: Intel Montecito and Sun Niagara" TLP on Chip: HT/SMT and CMP Intel Montecito Features Overview Dual threads Thread urgency Core arbiter
More informationOptimizing Sparse Matrix Vector Multiplication on Emerging Multicores
Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding, Mahmut Kandemir The Pennsylvania State University University Park, Pennsylvania, USA omk03, wzd09, kandemir@cse.psu.edu
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationLecture 24: Memory, VM, Multiproc
Lecture 24: Memory, VM, Multiproc Today s topics: Security wrap-up Off-chip Memory Virtual memory Multiprocessors, cache coherence 1 Spectre: Variant 1 x is controlled by attacker Thanks to bpred, x can
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationPaging. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Paging Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Paging Allows the physical address space of a process to be noncontiguous Divide virtual
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More information