Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Similar documents
Memory. From Chapter 3 of High Performance Computing. c R. Leduc

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Chapter 2: Memory Hierarchy Design Part 2

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Page 1. Multilevel Memories (Improving performance using a little cash )

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Lecture 18: DRAM Technologies

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

A Cache Hierarchy in a Computer System

LECTURE 5: MEMORY HIERARCHY DESIGN

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory. Lecture 22 CS301

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Review: Creating a Parallel Program. Programming for Performance

SPM Management Using Markov Chain Based Data Access Prediction*

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

1. Memory technology & Hierarchy

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Caches. Samira Khan March 23, 2017

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

ECE 5730 Memory Systems

Computer Architecture Spring 2016

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Introduction to OpenMP. Lecture 10: Caches

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Chapter 18 Parallel Processing

Chapter 2: Memory Hierarchy Design Part 2

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Memory Hierarchy Basics

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Final Lecture. A few minutes to wrap up and add some perspective

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Lecture notes for CS Chapter 2, part 1 10/23/18

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

ECE/CS 757: Homework 1

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

Computer Architecture Lecture 24: Memory Scheduling

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Cray XE6 Performance Workshop

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I

AS the processor-memory speed gap continues to widen,

Memory Consistency. Challenges. Program order Memory access order

740: Computer Architecture, Fall 2013 Midterm I

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Addressing the Memory Wall

ECE 485/585 Midterm Exam

ECEN 449 Microprocessor System Design. Memories. Texas A&M University

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

EE414 Embedded Systems Ch 5. Memory Part 2/2

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COSC 6385 Computer Architecture - Memory Hierarchies (II)

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

EITF20: Computer Architecture Part4.1.1: Cache - 2

Tutorial 11. Final Exam Review

Parallel Computing: Parallel Architectures Jin, Hai

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

15-740/ Computer Architecture

Effect of memory latency

Memory Hierarchy. Slides contents from:

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Adapted from David Patterson s slides on graduate computer architecture

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

Design of Embedded DSP Processors Unit 5: Data access. 9/11/2017 Unit 5 of TSEA H1 1

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Copyright 2012, Elsevier Inc. All rights reserved.

Performance metrics for caches

Advanced Memory Organizations

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Module 5a: Introduction To Memory System (MAIN MEMORY)

Transcription:

Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1

MPSoC - Vantagens MPSoC architecture has several advantages over a conventional strategy that employs a single, more powerful (but complex) processor on the chip: design of an on-chip multiprocessor composed of multiple simple processor cores is simpler than that of a complex single- processor system better utilization of the silicon space MPSoC architecture can exploit loop-level parallelism at the software level in array-intensive embedded applications energy savings through careful and selective management of individual processors 2

MPSoC Critical Component MPSoC: platform for executing array-intensive computations commonly found in embedded image and video processing applications Most critical components: memory system applications spend a significant portion of their cycles in the memory hierarchy the memory system can contribute up to 90% of the overall system power it is expected that a significant portion of the transistors in an MPSoC-based architecture will be devoted to the memory hierarchy 3

Ways of optimizing the memory performance 1. constructing a suitable memory organization/ hierarchy caches, scratch pad memories, stream buffers, FIFOs, etc 2. optimizing the software (application) for it traditional scheme: performance (execution cycles) MPSoC: energy/power consumption and memory space usage 4

MEMORY ARCHITECTURES The application-specific nature of embedded systems presents new opportunities for aggressive customization and exploration of architectural issues è the features of the given application can be used to determine the architectural parameters. Example: floating point arithmetic è DISCUSS! Traditionally, memory issues have been separately addressed by disparate research groups: computer architects compiler writers CAD/embedded systems community 5

Types of Architectures (1) Cache line size, associativity, that can be customized for a given application access time is subject to cache misses (2) Scratch Pad Memory (SPM) data memory residing on-chip that is mapped into an address space disjoint from the off-chip memory but connected to the same address and data busses fast access (SRAM), single-cycle access time 6

Cache and SPM 7

Cache and SPM 8

Simple example if source and mask were to be accessed through the data cache, the performance would be affected by cache conflicts storing the small mask array in the SPM eliminates all data conflicts in the data cache the data cache is used for memory accesses to source, which are very regular storing mask on-chip ensures that frequently accessed data are never ejected offchip, thereby significantly improving the memory performance and energy dissipation (a) Procedure CONV. (b) Memory access pa@ern in CONV. 9

Types of Architectures (3) DRAM multiple embedded memory (large) banks problems of modeling the access modes of synchronous DRAMs include: burst mode read/write: fast successive accesses to data in the same page interleaved row read/write modes: alternating burst accesses between banks. interleaved column access: alternating burst accesses between two chosen rows in different banks 10

Types of Architectures (4) Special Purpose Memories last-in, first-out protocol (LIFO) are used in microcontrollers queue or first-in, first-out protocol (FIFO) are used in network chips content-addressable memory (CAM) used in search applications 11

Customization of Memory Architectures Cache 1. cache line size 2. cache size if memory accesses are regular and consecutive (exhibit spatial locality), a longer cache line is desirable, since it minimizes the number of offchip accesses and exploits the locality by prefetching elements that will be needed in the immediate future. if the memory accesses are irregular, or have large strides, a shorter cache line is desirable, as this reduces off-chip memory traffic by not bringing unnecessary data into the cache. the maximum size of a cache line is the DRAM page size How to estimate? 12

Customization of Memory Architectures SPM + Cache MemExplore framework optimizes the on-chip data memory organization, addressing the following problem: given a certain amount of on-chip memory space, partition this into data cache and SPM so that the total access time and energy dissipation is minimized, i.e., the number of accesses to off-chip memory is minimized 13

DRAM optimization Multiple banks example of data set larger than the cache line Since each bank has its own private page buffer, there is no interference between the arrays, and the memory accesses do not represent a bo@leneck. 14

Memory Reconfigurabity reconfigure the cache (or SPM) architecture dynamically according to the application at hand The compiler can analyze a given application, divide its code into regions, and, for each region, select an optimum cache configuration for each processor Problems: architectural and circuit mechanisms for efficient and fast reconfiguration are essential control mechanisms for deciding when to reconfigure these caches are required mechanisms to determine the optimal configuration of the cache techniques for minimizing the overhead of data invalidation across different reconfiguration phases are essential 15

COMPILER SUPPORT Problems o Parallelism - parallelization strategy determines how memory is utilized by multiple on-chip processors and can be an important factor for achieving an acceptable performance - intrinsic data dependences - interprocessor communication costs Instruction and Data Locality - interprocessor communication can lead to frequent cache line invalidations/updates (interprocessor data sharing), which in turn increases overall latency - false sharing 16

Problems (cont) o Power/Energy Consumption - increasing the number of PE means powering up more processors along with their local caches (and/or SPMs) è more power - compiler should be able to balance the increase in power consumption and decrease in execution cycles o Memory Space - reducing memory space consumption might be of critical importance as doing so increases the effectiveness of on-chip memory utilization and can reduce the number of off-chip references 17

Solutions o Optimizing parallelism COMPILER SUPPORT - Parallelism can either be expressed by the programmer at the source level or be automatically derived by an optimizing compiler from the sequential code - By the compiler: - compiler needs to analyze data dependences and extract available parallelism - Task: 1. estimate performance, power consumption, and space requirements of a given piece of code 2. optimize the objective function under multiple constraints - It is easier to estimate the performance and energy consumption with SPMs (as opposed to caches) since the compiler is in full control of memory transfers 18

Solutions o Optimizing locality COMPILER SUPPORT - Locality optimization can be performed for both code and data - Goal: reduce the number of accesses to slower levels in the memory hierarchy. - Optimization of the data structures (statically or dynamically) - data space (memory layout) transformations, or data transformations for short - Example: software data prefetching 1. determine data references that are likely to be cache misses and therefore need to be prefetched 2. isolate the predicted cache miss instances through loop splitting 3. apply software pipelining and insert explicit prefetch instructions in the code. 19

COMPILER SUPPORT Solutions o Optimizing locality (cont) - From a multiple processor perspective, the locality problem is more challenging to tackle than the single processor - The SPM approach is simple and preferable - L1 is necessary à energy-aware coherence protocols/ algorithms in an important potential research direction 20

COMPILER SUPPORT Solutions o Optimizing Memory Space Utilization - consider lifetimes of program data structures/variables - enables sharing of the same data space - sharing data space may reduce performance due to data sharing - compiler should be able to resolve this tradeoff considering the performance and memory space con- straints at the same time. 21

COMPILER SUPPORT Solutions o Power/Energy Optimization - Power/energy and performance optimizations are conflict targets - Example: varying the number of PEs per application - more PEs è better performance (results normalized to 1 PE) 22

COMPILER SUPPORT Solutions o Power/Energy Optimization (cont) - But the energy increases 23

Solutions COMPILER SUPPORT o Power/Energy Optimization (cont) - most published work on parallelism for high-end machines is static; that is, the number of processors that execute the code is fixed for the entire execution - Alternative: adaptive parallelization - it is possible to consume much less energy by using the minimum number of processors for each loop nest, and shutting down the caches of the unused processors 24

Conclusions With the full advance knowledge of the applications being implemented by the system, many design parameters can be optimized and/or customized The optimal memory architecture for an applicationspecific system can be significantly different from the typical cache hierarchy of processors 25

http://www.artist-embedded.org/docs/events/2010/autrans/talks/pdf/teich/autrans2010_teich_trr89.ppt.pdf 26 26