Memory-Link Compression Schemes: A Value Locality Perspective

Similar documents
Memory-Link Compression Schemes: A Value Locality Perspective. Georgy Ushakov Institutt for datateknikk og informasjonsteknologi

Mobile Transport Layer

Chapter 13 TRANSPORT. Mobile Computing Winter 2005 / Overview. TCP Overview. TCP slow-start. Motivation Simple analysis Various TCP mechanisms

Outline 9.2. TCP for 2.5G/3G wireless

CSE 4215/5431: Mobile Communications Winter Suprakash Datta

Mobile Communications Chapter 9: Mobile Transport Layer

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Mobile Communications Chapter 9: Mobile Transport Layer

Architecture Tuning Study: the SimpleScalar Experience

Overview IN this chapter we will study. William Stallings Computer Organization and Architecture 6th Edition

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Characteristics of Memory Location wrt Motherboard. CSCI 4717 Computer Architecture. Characteristics of Memory Capacity Addressable Units

LECTURE 11. Memory Hierarchy

Memory Hierarchy. Slides contents from:

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Memory. Lecture 22 CS301

Computer Architecture. R. Poss

Page 1. Multilevel Memories (Improving performance using a little cash )

Unit 8: Superscalar Pipelines

Memory Hierarchy. Slides contents from:

Advanced Computer Architecture

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

ECE 30 Introduction to Computer Engineering

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Selective Fill Data Cache

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Advanced Memory Organizations

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

MIPS) ( MUX

The Memory Hierarchy & Cache

Caches. Samira Khan March 23, 2017

Neuro-fuzzy admission control in mobile communications systems

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Memory Systems IRAM. Principle of IRAM

Improving Processor Efficiency Through Enhanced Instruction Fetch

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Dynamic Control Hazard Avoidance

SORENSON VIDEO COMPRESSION

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Write only as much as necessary. Be brief!

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

A Statistical Model of Skewed-Associativity. Pierre Michaud March 2003

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

CMPSC 311- Introduction to Systems Programming Module: Caching

Chap. 4 Multiprocessors and Thread-Level Parallelism

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology

Memory Hierarchy: Caches, Virtual Memory

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

Cache memory. Lecture 4. Principles, structure, mapping

GDSII to OASIS Converter Performance and Analysis

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

CS 136: Advanced Architecture. Review of Caches

Kaisen Lin and Michael Conley

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and

CS152 Computer Architecture and Engineering SOLUTIONS Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4

Lecture 2: Memory Systems

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Chapter 18 - Multicore Computers

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

Master Informatics Eng.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Virtual Memory Design and Implementation

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

CS370 Operating Systems

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Computer Sciences Department

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CS 654 Computer Architecture Summary. Peter Kemper

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Tutorial 11. Final Exam Review

Parallelism. CS6787 Lecture 8 Fall 2017

Scheduling Computations on a Software-Based Router

Lecture: Interconnection Networks

Online Course Evaluation. What we will do in the last week?

Meysam Taassori. This. and. outpu. sending frequent

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář

Multiprocessors & Thread Level Parallelism

Two-Level Address Storage and Address Prediction

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

ECSE 425 Lecture 21: More Cache Basics; Cache Performance

Transcription:

Memory-Link Compression Schemes: A Value Locality Perspective Martin Thuresson, Lawrence Spracklen and Per Stenström IEEE Presented by Jean Niklas L orange and Caroline Sæhle, for TDT01, Norwegian University of Science and Technology November 11, 2013

Introduction What is the problem exactly? Limits to dealing with processor-memory gap on-chip Diminishing returns to making deeper cache hierarchies Limited bandwidth already a problem for memory-bound applications, getting worse: Pin count goes up less than 10% as transistor count doubles Transition to multicore processing increases bandwidth usage Control speculation, hardware scouting, and value prediction increase bandwidth usage by 15-30% Latency-compensating techniques come at the expense of bandwidth Solution: Reduce the bandwidth needed!

Introduction Memory-link compression Compress data before it is sent to link between last-level cache and memory Decompress data before it is installed in cache or written back to memory Storing compressed data in memory also a possibility, but not investigated in paper Additional advantage: Reduces transfer time, miss penalty Disadvantages: Increases transfer latency, lightweight compression is needed Cache Comp/ Decomp Link Comp/ Decomp Main Memory

Introduction Memory-link compression cont d Lightweight compression schemes Significance-width compression (SWC) exploits small value locality Delta encoding exploits clustered value locality Citron scheme and frequent value encoding (FVE) scheme exploit isolated value locality Which are relevant for different domains integer, multimedia, and commercial applications? Does a combination of schemes work better?

Value Locality So what is value locality? A program attribute that describes the likelihood of the recurrence of previously-seen program values 1 The paper analyses the locality and compressibility of memory-link traffic using the Simics full-system simulator. Results are fairly consistent and not uniform a large number of transferred values are either very small or large. 1 Lepak, K.M.; Lipasti, M.H., On the value locality of store instructions, 2000

Value Locality Small Value Locality Small Value Locality Significance-width compression (SWC) utilises small value locality: Many values are small. Encoded in two parts: An integer x with fixed width, representing the number of remaining bits x bits, representing the actual value Fast and simple approach, extremely parallelizable Good compression rates for small values Significant overhead for large values

Value Locality Small Value Locality SWC compression results Frees up 30% bandwidth on average, with 5 bits representing remaining bits Different binning schemes were tried, but worked well only for integer problems

Value Locality Clustered Value Locality Clustered Value Locality Many values are close to each other, even if they are large. By using delta encoding, we can utilise this property. Have multiple cluster values in a cache, pick the closest found Send over index of cluster value, along with the difference to the actual value Or insert current value in cache (LRU) if the difference exceeds a threshold, and send the raw value over the wire Larger cache, more index bits. Larger threshold, larger difference. What are optimal values? Miss? Link Cache Δ-Cache Δ-Cache Main Memory

Value Locality Clustered Value Locality Delta encoding compression results

Value Locality Clustered Value Locality Delta encoding compression results Very good compression rate. Used on average 12 bits for integer, 14 for media, and 20 for commercial, yielding a 60% compression rate on average. Setting the treshold to 16 and cache size to 32 gave optimal results on average. No sensible results for commercial programs: Presumably because of many data ranges at any given time.

Value Locality Isolated Value Locality Isolated Value Locality Programs tend have many similar values, for example 0 and 1. Can utilise this to avoid sending same value over and over again. There are two schemes to handle this: 1 The Frequent Value Encoding (FVE) Scheme store frequent values in a cache 2 The Citron Scheme an FVE scheme scheme on the 16 most significant bits

Value Locality Isolated Value Locality The FVE and Citron scheme Frequent Value Encoding (FVE) scheme: Keep a cache with the least recently used values If the value is in the cache, send only the index over the wire Otherwise, update the cache (LRU) and send the whole value over Citron scheme: Split the value in two 16 bit values Perform the FVE scheme on the 16 most significant bits Send over the 16 least significant bits Again question about optimal cache size for both schemes.

Value Locality Isolated Value Locality FVE scheme compression results

Value Locality Isolated Value Locality The Citron scheme compression results

Value Locality Isolated Value Locality FVE and Citron compression results Optimal cache size for both seems to be 32. Citron compresses on average down to 20 bits, or around a 40 % reduction. FVE manages half of that: 10 bits, almost 70 % reduction. Miss component is quite big, even with a large cache.

Combining Value Locality Properties Combining Value Locality Properties None of these compression schemes handle more than one type of locality. Why not try to combine them?

Combining Value Locality Properties Small and Clustered Value Locality Small and Clustered Value Locality Some inefficiency in delta encoding delta values to be transferred are usually small Using SWC, we can compress the delta value before sending it over the memory link The combined compression is quite effective, especially for integer and media applications For commercial applications, the combined gain is better than the separate gain, but not as spectacular Cache Delta SWC Link

Combining Value Locality Properties Small and Clustered Value Locality Small and Clustered Value Locality compression results

Combining Value Locality Properties Small and Isolated Value Locality Small and Isolated Value Locality Citron and FVE schemes are quite effective at reducing bandwidth usage, but transferring data on a miss in the value cache is inefficient Using SWC, we can utilise this more efficiently! When updating the value cache on a miss, a new value can be sent SWC encoded Small values, defined as being represented by 16 bits are less, don t need to be stored in the value cache as SWC is very efficient already So, the small values and large values can be compressed in parallel, which may give low latency Cache SWC FVE Link

Combining Value Locality Properties Small and Isolated Value Locality Small and Isolated Value Locality compression results

Combining Value Locality Properties Small and Isolated Value Locality Small and Isolated Value Locality compression results

Combining Value Locality Properties Small and Isolated Value Locality Small and Isolated Value Locality cont d Combining SWC and FVE increases compressibility in integer applications and commercial applications, but is not helpful for media applications For media applications, values are either zero or need all 32 bits to be represented, leading to no benefit from SWC, only added overhead For the Citron scheme, combining it with SWC gives better results than both individually this benefit comes primarily from compressing the 16 least significant bits

Conclusion Conclusion Identified three categories of value locality: small, clustered, and isolated. Measured previously proposed techniques by their compressibility, using a consistent framework. As the previous schemes only targeted one of the three categories, the authors proposed two new schemes by combining the previous techniques. Got a 70-75 % bandwidth reduction, compared to previously 35-60 %.

Conclusion Discussion Discussion While bandwidth reduction is good, there are no performance measures: What are the speedups? Are there any energy savings? Although evident, the architecture will be more complex. Increasing performance of this new compression scheme seems hard. The commercial multicore programs have a worse bandwidth reduction than the single core programs. Compression quality may degrade with more cores.

Questions?