Comparing Memory Systems for Chip Multiprocessors

Similar documents
Comparing Memory Systems for Chip Multiprocessors

Understanding Sources of Inefficiency in General-Purpose Chips

A Memory System Design Framework: Creating Smart Memories

The University of Texas at Austin

Practical Near-Data Processing for In-Memory Analytics Frameworks

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Page 1. Multilevel Memories (Improving performance using a little cash )

Flexible Architecture Research Machine (FARM)

High Performance Computing on GPUs using NVIDIA CUDA

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

1. Memory technology & Hierarchy

Data Parallel Architectures

Master Informatics Eng.

Copyright 2012, Elsevier Inc. All rights reserved.

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Memory Hierarchy. Slides contents from:

Chapter 2: Memory Hierarchy Design Part 2

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

LECTURE 5: MEMORY HIERARCHY DESIGN

Lecture notes for CS Chapter 2, part 1 10/23/18

Advanced Memory Organizations

Chapter 2: Memory Hierarchy Design Part 2

Caching Basics. Memory Hierarchies

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

Parallel Processing SIMD, Vector and GPU s cont.

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Architecture Memory hierarchies and caches

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Introduction to OpenMP. Lecture 10: Caches

Mo Money, No Problems: Caches #2...

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Parallel Computing: Parallel Architectures Jin, Hai

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Cray XE6 Performance Workshop

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

A Cache Hierarchy in a Computer System

Memory Hierarchy. Slides contents from:

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Processor Architecture and Interconnect

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

ibench: Quantifying Interference in Datacenter Applications

Shared Memory Multiprocessors

Lecture 11 Cache. Peng Liu.

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Topics. Digital Systems Architecture EECE EECE Need More Cache?

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

THE DYNAMIC GRANULARITY MEMORY SYSTEM

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

TDT 4260 lecture 3 spring semester 2015

Advanced Caching Techniques

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 7 - Memory Hierarchy-II

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Fall 2007 Prof. Thomas Wenisch

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Optimising for the p690 memory system

Introduction to Parallel Programming Models

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Keywords and Review Questions

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Lecture 16: Cache in Context (Uniprocessor) James C. Hoe Department of ECE Carnegie Mellon University

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund

Convergence of Parallel Architecture

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Advanced Caching Techniques

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

12 Cache-Organization 1

I/O Buffering and Streaming

Transcription:

Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University ISCA 2007 1

Cores are the New GHz M P M M M M M P P P P P P P P P P P M M M M M M 90s: GHz & ILP Problems: power, complexity, ILP limits 00s: cores Multicore, manycore, 2

What is the New Memory System? M M M M M M P P P P P P P P P P P P M M M M M M Cache-based Memory Streaming Memory Tags Data Array Local Storage Cache Controller DMA Engine 3

The Role of Local Memory Cache-based Memory Streaming Memory Tags Data Array Local Storage Cache Controller DMA Engine Exploit spatial & temporal locality Reduce average memory access time Enable data re-use Amortize latency over several accesses Minimize off-chip bandwidth Keep useful data local 4

Who Manages Local Memory? Locality Cache-based Streaming Data Fetch Reactive Proactive Placement Limited mapping Arbitrary Replacement Fixed-policy Arbitrary Granularity Cache block Arbitrary Communication Coherence Hardware Software Cache-based: Hardware-managed Streaming: Software-managed 5

Potential Advantages of Streaming Memory Better latency hiding Overlap DMA transfers with computation Double buffering is macroscopic prefetching Lower off-chip bandwidth requirements Avoid conflict misses Avoid superfluous refills for output data Avoid write-back of dead data Avoid fetching whole lines for sparse accesses Better energy and area efficiency No tag & associativity overhead Fewer off-chip accesses 6

How Much Advantage over Caching? How do they differ in Performance? How do they differ in Scaling? How do they differ in Energy Efficiency? How do they differ in Programmability? 7

Our Contribution: A Head to Head Comparison Cache-based Memory vs. Streaming Memory Unified set of constraints Same processor core Same capacity of local storage per core Same on-chip interconnect Same off-chip memory channel Justification VLSI constraints (e.g., local storage capacity) No fundamental differences (e.g., core type) 8

Our Conclusions Caching performs & scales as well as Streaming Well-known cache enhancements eliminate differences Stream Programming benefits Caching Memory Enhances locality patterns Improves bandwidth and efficiency of caches Stream Programming easier with Caches Makes memory system amenable to irregular & unpredictable workloads Streaming Memory likely to be replaced or at least augmented by Caching Memory 9

Simulation Parameters 1 16 cores: Tensilica LX, 3-way VLIW, 2 FPUs Clock frequency: 800 MHz 3.2 GHz On-chip data memory Cache-based: 32kB cache, 32B block, 2-way, MESI Streaming: 24kB scratch pad DMA engine 8kB cache, 32B block, 2-way Both: 512kB L2 cache, 32B block, 16-way System Hierarchical on-chip interconnect Simple main memory model (3.2 GB/s 12.8 GB/s) 10

Benchmark Applications No SPEC Streaming Few available apps with streaming & caching versions Selected 10 streaming applications Some used to motivate or evaluate Streaming Memory Co-developed apps for both systems Caching: C, threads Streaming: C, threads, DMA library Optimized both versions as best we could 11

Benchmark Applications Video processing Stereo Depth Extraction H.264 Encoding MPEG-2 Encoding Image processing JPEG Encode/Decode KD-tree Raytracer 179.art Scientific and data-intensive 2D Finite Element Method 1D Finite Impulse Response Merge Sort Bitonic Sort Unpredictable Irregular 12

Our Conclusions Caching performs & scales as well as Streaming Well-known cache enhancements eliminate differences Stream Programming benefits Caching Memory Enhances locality patterns Improves bandwidth and efficiency of caches Stream Programming easier with Caches Makes memory system amenable to irregular & unpredictable workloads Streaming Memory likely to be replaced or at least augmented by Caching Memory 13

Parallelism Independent of Memory System MPEG-2 Encoder @ 3.2 GHz FEM @ 3.2 GHz Normalized Time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 CPU 2 CPUs 4 CPUs 8 CPUs 16 CPUs 0.7 12.4x 13.8x 0.6 Normalized Time 1 0.9 0.8 0.5 0.4 0.3 0.2 0.1 0 1 CPU 2 CPUs 4 CPUs 8 CPUs 16 CPUs Cache Streaming Cache Streaming 6/10 apps little affected by local memory choice 14

Local Memory Not Critical For Compute-Intensive Applications Normalized Time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 16 cores @ 3.2 GHz Useful Data Sync Cache Stream Cache Stream MPEG-2 FEM Intuition Apps limited by compute Good data reuse, even with large datasets Low misses/instruction Note: Sync includes Barriers and DMA wait 15

Double-Buffering Hides Latency For Streaming Memory Systems 16 cores @ 3.2 GHz, 12.8 GB/s Useful Data Sync 1 0.9 0.8 Intuition Non-local accesses entirely overlapped with computation DMAs perform efficient SW prefetching Normalized Time 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Cache Stream Note The case for memoryintensive apps not bound by memory BW FIR 179.art, Merge Sort 16

Prefetching Hides Latency For Cache-Based Memory Systems Normalized Time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 16 cores @ 3.2 GHz, 12.8 GB/s Useful Data Sync Intuition HW stream prefetcher overlaps misses with computation as well Predictable & regular access patterns 0 Cache Prefetch Stream FIR 17

Streaming Memory Often Incurs Less Off-Chip Traffic Normalized Off-chip Traffic 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Cac Str Cac Str Cac Str FIR Merge Sort MPEG-2 Write Read The case for apps with large output streams Avoids superfluous refills for output streams Not the case for write-allocate, fetch-on-miss caches 18

SW-Guided Cache Policies Improve Bandwidth Efficiency Normalized Off-chip Traffic 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Cac PFS Str Cac PFS Str Cac PFS Str FIR Merge Sort MPEG-2 Write Read Our system: Prepare For Store cache hint Allocates cache line but avoid refill of old data Xbox360: write-buffer for non allocating writes 19

Energy Efficiency Does not Depend on Local Memory 16 cores @ 800 MHz Intuition Normalized Energy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 DRAM L2-cache L-store D-cache I-cache Core Energy dominated by DRAM accesses and processor core Local store ~2x energyefficiency of cache, but small portion of total energy 0.2 0.1 0 Note Cache Stream Cache Stream Cache Stream The case for computeintensive applications MPEG-2 FEM FIR 20

Optimized Bandwidth Yields Optimized Energy Efficiency 1 0.9 1 0.9 Normalized Off-chip Traffic 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Write Read Normalized Energy 0.8 0.7 0.6 0.5 0.4 0.3 0.2 DRAM L2-cache L-store D-cache I-cache Core 0 Cache PFS Stream FIR 0.1 0 Cache PFS Stream FIR Superfluous off-chip accesses are expensive! Streaming & SW-guided caching reduce them 21

Our Conclusions Caching performs & scales as well as Streaming Well-known cache enhancements eliminate differences Stream Programming benefits Caching Memory Enhances locality patterns Improves bandwidth and efficiency of caches Stream Programming easier with Caches Makes memory system amenable to irregular & unpredictable workloads Streaming Memory likely to be replaced or at least augmented by Caching Memory 22

Stream Programming for Caches: MPEG-2 Example P Predicted Video Frame T MPEG-2 example P() generates a video frame later consumed by T() Whole frame is too large to fit in local memory No temporal locality Opportunity Computation on frame blocks are independent 23

Stream Programming for Caches: MPEG-2 Example P Predicted Video Frame T Predicted block Introducing temporal locality Loop fusion for P() and T() at block level Intermediate data are dead once T() done 24

Stream Programming for Caches: MPEG-2 Example P T Predicted block Exploiting producer-consumer locality Re-use the predicted block buffer Dynamic working set reduced Fits in local memory; no off-chip traffic 25

Stream Programming for Caches: MPEG-2 Example Normalized Off-chip Traffic 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Unoptimized 2 cores Optimized MPEG-2 Stream Write Read Stream programming beneficial for any Memory System Exposes locality that improves bandwidth and energy efficiency of local memory Stream programming toolchains helpful 26

Our Conclusions Caching performs & scales as well as Streaming Well-known cache enhancements eliminate differences Stream Programming benefits Caching Memory Enhances locality patterns Improves bandwidth and efficiency of caches Stream Programming easier with Caches Makes memory system amenable to irregular & unpredictable workloads Streaming Memory likely to be replaced or at least augmented by Caching Memory 27

Stream Programming is Easier with Caches Stream programming necessary for correctness on Streaming Memory Must refactor all dataflow Caches can use Stream programming for performance, not correctness Incremental tuning Doesn t require up-front holistic analysis Why is this important? Many streaming apps include some unpredictable patterns 28

Specific Examples Raytracing Unpredictable tree accesses Software caching on Cell (Benthin 06) Emulation overhead, DMA latency for refills Tree accesses have good locality on HW caches 3-D shading Unpredictable texture accesses Texture accesses have good locality on HW caches Caches are ubiquitous on GPUs 29

Our Conclusions Caching performs & scales as well as Streaming Well-known cache enhancements eliminate differences Stream Programming benefits Caching Memory Enhances locality patterns Improves bandwidth and efficiency of caches Stream Programming easier with Caches Makes memory system amenable to irregular & unpredictable workloads Streaming Memory likely to be replaced or at least augmented by Caching Memory 30

Limitations of This Work Did not scale beyond 16 cores Does cache coherence scale? Application scope May not generalize to other domains General-purpose!= application-specific Sensitivity to local storage capacity Intractable without language/compiler support 31

Future Work Scale beyond 16 cores Exploit streaming SW to assist HW coherence Extend application scope Generalize to other domains Consider further optimizations Study sensitivity to local storage capacity Introduce language/compiler support 32

Thank you! Questions? 33

Streaming Memory and L2 Caches 2.5 16 cores w/o 512kB L2 L2 caches mitigate overfetch in Streaming apps 2 Unstructured meshes Normalized Off-chip Traffic 1.5 1 Write Read Motion estimation search window reference frames Motion Estimation 0.5 0 Cache Stream Cache Stream FEM MPEG-2 34

Streaming Memory Occasionally Consumes More Bandwidth Bitonic Sort The problem Data-dependent write pattern 1.4 Normalized Off-chip Traffic 1.2 1 0.8 0.6 0.4 0.2 Write Read Caching Automatically track modified state Write back only dirty data Streaming Writes back everything 0 Cache Stream Programming burden & overhead to track modified state 35