PACE: Power-Aware Computing Engines

Similar documents
HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Keywords and Review Questions

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory Systems IRAM. Principle of IRAM

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

EECS 322 Computer Architecture Superpipline and the Cache

Processor (IV) - advanced ILP. Hwansoo Han

Multithreading and the Tera MTA. Multithreading for Latency Tolerance

Computer Architecture s Changing Definition

FABRICATION TECHNOLOGIES

Creating a Scalable Microprocessor:

Pipelining to Superscalar

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University


Processors. Young W. Lim. May 12, 2016

EC 513 Computer Architecture

Power Estimation of UVA CS754 CMP Architecture

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Embedded Systems: Hardware Components (part I) Todor Stefanov

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

The Processor: Instruction-Level Parallelism

Lecture 1: Introduction

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Itanium 2 Processor Microarchitecture Overview

Introduction to Microprocessor

The T0 Vector Microprocessor. Talk Outline

Low-power Architecture. By: Jonathan Herbst Scott Duntley

CN310 Microprocessor Systems Design

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

VLIW Digital Signal Processor. Michael Chang. Alison Chen. Candace Hobson. Bill Hodges

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Jim Keller. Digital Equipment Corp. Hudson MA

Advanced Computer Architecture

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

SYLLABUS. osmania university CHAPTER - 1 : REGISTER TRANSFER LANGUAGE AND MICRO OPERATION CHAPTER - 2 : BASIC COMPUTER

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Chapter 4 The Processor (Part 4)

ECE 5745 Complex Digital ASIC Design Course Overview

The Pentium II/III Processor Compiler on a Chip

Chapter 13 Reduced Instruction Set Computers

CS 152, Spring 2011 Section 8

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

General Purpose Signal Processors

Dynamic Scheduling. CSE471 Susan Eggers 1

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Write only as much as necessary. Be brief!

The ARM10 Family of Advanced Microprocessor Cores

The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Advanced Instruction-Level Parallelism

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

A 1-GHz Configurable Processor Core MeP-h1

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Value Compression for Efficient Computation

MIT Laboratory for Computer Science

ECE 2300 Digital Logic & Computer Organization. Caches

Computer Systems Organization

An Overview of Standard Cell Based Digital VLSI Design

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

VLSI Design Automation. Maurizio Palesi

Application of Power-Management Techniques for Low Power Processor Design

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

IMAGINE: Signal and Image Processing Using Streams

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators"

asoc: : A Scalable On-Chip Communication Architecture

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

LECTURE 10: Improving Memory Access: Direct and Spatial caches

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

VLIW/EPIC: Statically Scheduled ILP

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

ARM Processors for Embedded Applications

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Lecture 7 - Memory Hierarchy-II

CS146 Computer Architecture. Fall Midterm Exam

MIT Laboratory for Computer Science

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Embedded Systems. 7. System Components

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

High performance, power-efficient DSPs based on the TI C64x

Constructing Virtual Architectures on Tiled Processors. David Wentzlaff Anant Agarwal MIT

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Transcription:

PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/

PACE Approach Energy- Conscious Compilers Rethink Hardware-Software Interface for Power-Aware Computing Energy- Exposed Architectures

Conventional Architectures only Expose Performance Current RISC/VLIW ISAs only expose hardware features that affect critical path through computation

Energy Consumption is Hidden Most energy is consumed in microarchitectural operations that are from software!

Energy-Exposed Instruction Sets Reward compile-time knowledge with run-time energy savings hardware provides mechanisms to disable microarchitectural activity, a compile-time analysis determines which pieces of microarchitecture can be disabled for given application Co-develop energy-exposed architectures and energy-conscious compilers

Energy Management Layers Application Algorithm Source Code Compiler Run-Time/O.S. Instruction Set PACE Focus Areas Microarchitecture Circuit Design Fabrication Technology

SCALE Strawman Processor /O 32 processing tiles Fast on-chip data network 128x32b FLOP/cycle total 4096x8b OP/cycle total 128MB on-chip DRAM/16MB SRAM External DRAM interface Chip-to-chip interconnect channels 20x20mm 2 in 0.1µm CMOS Tile Bulk SRAM/ Embedded DRAM Data Unit Addr. Unit Cntl. Unit Off-chip DRAM SRAM/cache Data Net

SCALE Processor Tile Details Data Unit Address Unit Control Unit FP Multiplier AALU0 CALU DALU0 DReg0 64x64b ARegs C Regs DALU1 16x32b 16x32b VLIW DReg1 64x64b and AALU1 Inst. Fetch DALU2 Config. Memory &Decode DReg2 64x64b Cache Management Inst. Buffer DALU3 Tag PC DReg3 64x64b Store B Regs 8x32b FP Adder BALU Address/Data Interconnect Data Net 32KB SRAM (16 banks x 256 words x 64 bits)

SCALE Supports All Forms of Parallelism most streaming applications highly vectorizable vectors reduce instruction fetch/decode energy up to 20-60x (depends on vector length) mature programming and compilation model SCALE supports vectors in hardware address and data units optimized for vectors hardware vector control logic Vector Instructions Cntl. Unit Vector Control Addr. Unit Data Unit exploit instruction-level parallelism for nonvectorizable applications superscalar ILP expensive in hardware SCALE supports VLIW-style ILP reuse address and data unit datapath resources expose datapath control lines single wide instruction = configuration provide control/configuration cache distributed along datapaths run separate threads on different tiles any mix of vector or VLIW across tiles VLIW Program Counter Cntl. Unit VLIW Cache Addr. Unit Data Unit Thread 1Thread 2 Thread 3Thread 4

SCALE Exposes Locality at Multiple Levels 2D Tile and DRAM layout software maps computation to minimize network hops Local SRAM within tile software split between instruction/data/unified storage software scratchpad RAMs or hardware-managed caches Distributed cached control state within tile control unit: instruction buffer data/address unit: vector instructions or VLIW/configuration cache Distributed register file and ALU clusters within tile Control Unit: scalar (C) registers versus branch (B) registers Address Unit: address (A) registers Data Unit: Four clusters of data registers (D0-D4) Accumulators and sneak paths to bypass register files

SCALE Software Power Grid Turn off unused register banks and ALUs Reduce datapath width set width separately for each unit in tile (e.g., 32b in control unit, 16b in address unit, 64b in data unit) Turn off individual local memory banks Configure memory addressing model From hardware cache-coherence to local scratchpad RAM Turn off idle tiles and idle inter-tile network segments Turn off refresh to unused DRAM banks

Existing Infrastructure RAW Compiler Technology SUIF-based C/FORTRAN compiler for tiled arrays SPAN pointer analysis Bitwise bitwidth analysis Superword Level Parallelism Space/Time scheduling MAPS compiler-managed memory system Pekoe Low-Power Microprocessor Library Cells Full-custom processor blocks in 0.25µm CMOS process Designed for voltage-scaled operation SyCHOSys Energy-Performance Simulator Fast, multi-level compiled simulation Energy models for Pekoe processor blocks

Bitwidth Analysis Compile-time detection of minimum bitwidth required for each variable at every static location in the program A collection of techniques Arithmetic operations Boolean operations Bitmask operations Loop induction variable bounding Clamping optimization Type promotion Back propagation Array index optimization Value-range propagation using data-flow analysis Loop analysis Incorporated pointer alias analysis Paper in PLDI 00

Average Dynamic Power (mw) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Bitwidth Power Savings (C ASIC Synthesis) Methodology C RTL RTL simulation gives switching Synthesis tool reports dynamic power IBM SA27E process, 0.15µm drawn, 200 MHz Base case Bitwidth analysis b bblesort histogram jacobi pmatch

SyCHOSys Energy-Performance Simulation SyCHOSys compiles a custom cycle simulator from a structural machine description Supports gate level to behavioral level, or any mixture Behavior specified in C++, compiles to C++ object Can selectively compile in transition counting on nets Automatically factors out common counts for faster simulation Arbitrary energy models for functional units/memories Capacitances extracted from circuit layout or estimated Use fast bit-parallel structural energy models (much faster than lookups) Paper in Complexity-Effective Workshop, ISCA 00

SyCHOSys Evaluation GCD circuit benchmark full-custom datapath layout (0.25µm TSMC CMOS process) mixture of static and precharged blocks C-Behavioral (gcc) Verilog-Behavioral (VCS) Verilog-Structural (VCS) SyCHOSys-Structural SyCHOSys-Power PowerMill (extracted layout) Star-Hspice (extracted layout) Simulation Speed (Hz) 109,000,000.00 544,000.00 341,000.00 8,000,000.00 195,000.00 0.73 0.01 Error in power prediction N/A N/A N/A N/A 0.5% - 8.2% 7.2% - 13.7% 0%(reference)

SyCHOSys Processor Model Five-stage pipelined MIPS RISC processor+caches User/kernel mode, precise interrupts, validated with architectural test suite+random test programs Runs SPECint95 benchmarks Simulation speeds (Sun Ultra-5, 333MHz workstation) (ISA-level interpreter 3 MHz) Behavioral RTL 400kHz Structural model 40kHz Energy model 16kHz A Gigacycle/CPU-day or Megacycle/CPU-minute with better accuracy than Powermill

PACE Milestones Year 2000: Baseline design Baseline SCALE architecture definition RAW compiler generating code for baseline SCALE design Baseline SCALE architecture energy-performance simulator Year 2001: Single tile Energy-exposed SCALE tile architecture definition Energy-conscious compiler passes for SCALE tile Energy-exposed SCALE tile energy-performance simulator Evaluation of energy-exposed SCALE tile Year 2002: Multi-tile Energy-exposed SCALE multi-tile architecture definition Multi-tile energy-performance simulator Multi-tile energy-conscious compiler passes Evaluation of multi-tile SCALE processor (Options: Fabricate SCALE prototype)