Evaluation of a High Performance Code Compression Method

Similar documents
Annotated Memory References: A Mechanism for Informed Cache Management

Code Compression for DSP

Space-efficient Executable Program Representations for Embedded Microprocessors

Improving Code Density Using Compression Techniques

The Impact of Instruction Compression on I-cache Performance

Efficient Execution of Compressed Programs

Allocation By Conflict: A Simple, Effective Cache Management Scheme

Code Compression for RISC Processors with Variable Length Instruction Encoding

Code Compression for DSP

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

The Von Neumann Computer Model

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Computer Science 146. Computer Architecture

Chapter-5 Memory Hierarchy Design

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Improving Code Density Using Compression Techniques

predicted address tag prev_addr stride state effective address correct incorrect weakly predict predict correct correct incorrect correct

35 th Design Automation Conference Copyright 1998 ACM

Replenishing the Microarchitecture Treasure Chest. CMuART Members

Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering.

Low-Complexity Reorder Buffer Architecture*

A Cost-Effective Clustered Architecture

Multiple-Banked Register File Architectures

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Statistical Simulation of Superscalar Architectures using Commercial Workloads

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The character of the instruction scheduling problem

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

An Instruction Stream Compression Technique 1

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

I-CACHE INSTRUCTION FETCHER. PREFETCH BUFFER FIFO n INSTRUCTION DECODER

An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches

Cost Effective Memory Dependence Prediction using Speculation Levels and Color Sets Λ

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Trade-offs for Skewed-Associative Caches

Unit 2. Chapter 4 Cache Memory

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory

TUTORIAL Describe the circumstances that would prompt you to use a microprocessorbased design solution instead of a hard-wired IC logic design.

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

CSE Computer Architecture I Fall 2011 Homework 07 Memory Hierarchies Assigned: November 8, 2011, Due: November 22, 2011, Total Points: 100

Types of Cache Misses: The Three C s

Characteristics. Microprocessor Design & Organisation HCA2102. Unit of Transfer. Location. Memory Hierarchy Diagram

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Se-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

Control Speculation in Multithreaded Processors through Dynamic Loop Detection

Computer & Microprocessor Architecture HCA103

Speculative Execution via Address Prediction and Data Prefetching

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

LECTURE 10: Improving Memory Access: Direct and Spatial caches

ECE Lab 8. Logic Design for a Direct-Mapped Cache. To understand the function and design of a direct-mapped memory cache.

Page 1. Multilevel Memories (Improving performance using a little cash )

AS the processor-memory speed gap continues to widen,

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Code Compression for the Embedded ARM/THUMB Processor

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Computer Organization & Assembly Language Programming

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Memory hierarchy and cache

Instruction Set Principles and Examples. Appendix B

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

An Energy-Efficient High-Performance Deep-Submicron Instruction Cache

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Transient-Fault Recovery Using Simultaneous Multithreading

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Introduction to Pipelined Datapath

DataScalar Architectures

Dynamic Memory Scheduling

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

One-Level Cache Memory Design for Scalable SMT Architectures

h Coherence Controllers

Exploiting Idle Floating-Point Resources for Integer Execution

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Computing Along the Critical Path

Chapter 6 Objectives

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

TDT 4260 lecture 7 spring semester 2015

CHAPTER 15. Exploiting Load/Store Parallelism via Memory Dependence Prediction

Getting CPI under 1: Outline

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Utilizing Reuse Information in Data Cache Management

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

Frequent Value Locality and Value-Centric Data Cache Design

Cache performance Outline

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

Copyright 2001 IEEE. Reprinted from IEEE Transactions on Computers.

Transcription:

Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor MICRO-32 November 16-18, 1999

Motivation Problem: embedded code size Constraints: cost, area, and power Fit program in on-chip memory Compilers vs. hand-coded assembly Portability Development costs Code bloat Solution: code compression Reduce compiled code size Take advantage of instruction repetition Systems use cheaper processors with smaller on-chip memories Implementation Code size? Execution speed? CPU I/O RAM Original Program CPU I/O RAM ROM Program ROM Compressed Program Embedded Systems 2

Overview CodePack IBM PowerPC instruction set First system with instruction stream compression 60% compression ratio, ±10% performance [IBM] performance gain due to prefetching Implementation Binary executables are compressed after compilation Compression dictionaries tuned to application Decompression occurs on L1 cache miss L1 caches hold decompressed data Decompress 2 cache lines at a time (16 insns) PowerPC core is unaware of compression 3

CodePack encoding 32-bit insn is split into 2 16-bit words Each 16-bit word compressed separately Encoding for upper 16 bits Encoding for lower 16 bits 8 00 xxx 1 00 Encodes zero 32 01 xxxxx 16 01xxxx 64 100 xxxxxx 23 100 xxxxx 128 101 xxxxxxxx 128 101xxxxxxxx 256 110 xxxxxxxxx 256 110xxxxxxxxx 111xxxxxxxxxxxxxxxxx 111xxxxxxxxxxxxxxxxx Tag Index Escape Raw bits 4

CodePack decompression Fetch index L1 I-cache miss address 0 5 6 25 26 31 Index table (inmainmemory) Fetch compressed instructions Byte-aligned block address Compression Block (16 instructions) Compressed bytes (inmainmemory) 1 compressed instruction Hi tag Low tag Hi index Low index Decompress High dictionary Low dictionary Native Instruction High 16-bits Low 16-bits 5

applu apsi cc1 compress95 fpppp go hydro2d ijpeg li95 m88ksim mgrid mpeg2enc pegwit perl su2cor swim tomcatv turb3d vortex wave5 Compression ratio compressed size compressio n ratio = original size Average: 62% Compression ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 6

CodePack programs Compressed executable 17%-25% raw bits: not compressed! Includes escape bits Compiler optimizations might help 5% index table 2KB dictionary (fixed cost) 1% pad bits Tags 25% Escape 3% Indices 51% Dictionary 1% Raw bits 14% Pad 1% Index table 5% go 7

I-cache miss timing Native code uses critical word first Compressed code must be fetched sequentially Example shows miss to 5th instruction in cache line 32-bit insns, 64-bit bus a) Native code Instruction cache miss Instructions from main memory b) Compressed code Instruction cache miss Index from main memory Codes from main memory Decompressor c) Compressed code + optimizations Instruction cache miss Index from index cache Codes from main memory 2 Decompressors L1 cache miss A t=0 10 30 20 Fetch instructions (first line) B Decompression cycle 1 cycle Fetch index Fetch instructions (remaining lines) Critical instruction word 8

Baseline results CodePack causes up to 18% performance loss SimpleScalar 4-issue, out-of-order 16 KB caches Main memory: 10 cycle latency, 2 cycle rate Instructions per cycle 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 native CodePack cc1 go perl vortex 9

Optimization A: Index cache Remove index table access with a cache A cache hit removes main memory access for index optimized: 64 lines, fully assoc., 4 indices/line (<15% miss ratio) Within8%ofnativecode perfect: an infinite sized index cache Within5%ofnativecode 1.2 1.0 Speedup over native code 0.8 0.6 0.4 0.2 0.0 CodePack optimized perfect cc1 go perl vortex 10

Optimization B: More decoders Codeword tags enable fast extraction of codewords Enables parallel decoding Try adding more decoders for faster decompression 2 decoders: performance within 13% of native code 1.0 0.8 Speedup over native code 0.6 0.4 0.2 0.0 CodePack 2 insn/cycle 3 insn/cycle 16 insn/cycle cc1 go perl vortex 11

Comparison of optimizations Index cache provides largest benefit Optimizations index cache: 64 lines, 4 indices/line, fully assoc. 2nd decoder Speedup over native code: 0.97 to 1.05 Speedup over CodePack: 1.17 to 1.25 1.2 1 Speedup over native code 0.8 0.6 0.4 0.2 0 CodePack index cache 2nd decoder both optimizations cc1 go perl vortex 12

Cache effects Cache size controls normal CodePack slowdown Optimizations do well on small caches: 1.14 speedup 1.4 1.2 go benchmark Speedup over native code 1.0 0.8 0.6 0.4 0.2 0.0 CodePack optimized 1KB 4KB 16KB 64KB 13

Memory latency Optimized CodePack performs better with slow memories Fewer memory accesses than native code 1.2 go benchmark Speedup over native code 1.0 0.8 0.6 0.4 0.2 0.0 0.5x 1x 2x 4x 8x Memory latency CodePack optimized 14

Memory width CodePack provides speedup for small buses Optimizations help performance degrade gracefully as bus size increases Speedup over native code 1.2 1.0 0.8 0.6 0.4 0.2 0.0 go benchmark CodePack optimized 16 32 64 128 Bus size (bits) 15

Conclusions CodePack works for other instruction sets than PowerPC Performance can be improved at modest cost Remove decompression overhead: index lookup, dictionary lookup Compression can speedup execution Compressed code requires fewer main memory accesses CodePack includes simple prefetching Systems that benefit most from compression Narrow buses Slow memories Workstations might benefit from compression Fewer L2 misses Less disk access 16

Web page http://www.eecs.umich.edu/~tnm/compress 17