Conservation Cores: Reducing the Energy of Mature Computations

Similar documents
Conservation Cores: Reducing the Energy of Mature Computations

GreenDroid: A Mobile Application Processor for a Future of Dark Silicon

GreenDroid: An Architecture for the Dark Silicon Age

CSE 291: Mobile Application Processor Design

UNIVERSITY OF CALIFORNIA, SAN DIEGO. Design and Architecture of Automatically-generated Energy-reducing Coprocessors

UNIVERSITY OF CALIFORNIA, SAN DIEGO. Configurable Energy-efficient Co-processors to Scale the Utilization Wall

GREENDROID: AN ARCHITECTURE FOR DARK SILICON AGE. February 23, / 29

Dark Silicon and its Implications for Future Processor Design

UNIVERSITY OF CALIFORNIA, SAN DIEGO. ASIC life extension through hardware patch interfaces

Memory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012

QSCORES: Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores

Lab. Course Goals. Topics. What is VLSI design? What is an integrated circuit? VLSI Design Cycle. VLSI Design Automation

A General-Purpose Architectural Approach to Energy Efficiency for Greendroid Mobile Application Processor

FABRICATION TECHNOLOGIES

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Lecture 8: Synthesis, Implementation Constraints and High-Level Planning

Exploring Energy Scalability in Coprocessor-Dominated Architectures for Dark Silicon

Multicore Hardware and Parallelism

Response Time and Throughput

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

SYNTHESIS FOR ADVANCED NODES

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

The Computer Revolution. Classes of Computers. Chapter 1

Performance, Power, Die Yield. CS301 Prof Szajda

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

Introduction to CMOS VLSI Design (E158) Lecture 7: Synthesis and Floorplanning

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Parallelism in Hardware

Chapter 5: ASICs Vs. PLDs

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Transistors and Wires

Design Solutions in Foundry Environment. by Michael Rubin Agilent Technologies

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Functioning Hardware from Functional Specifications

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

COE 561 Digital System Design & Synthesis Introduction

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Introduction to GPU computing

Design Methodologies. Kai Huang

101-1 Under-Graduate Project Digital IC Design Flow

An Energy-Efficient Patchable Accelerator For Post-Silicon Engineering Changes

TDT4255 Computer Design. Lecture 1. Magnus Jahre

Lecture 2: Performance

Outline Marquette University

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Research Challenges for FPGAs

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

ASIC, Customer-Owned Tooling, and Processor Design

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Exploring Logic Block Granularity for Regular Fabrics

POSH: A TLS Compiler that Exploits Program Structure

Adaptive Voltage Scaling (AVS) Alex Vainberg October 13, 2010

Chapter 1. Computer Abstractions and Technology

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

An Introduction to Parallel Architectures

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases"

High Quality, Low Cost Test

ECE 154A. Architecture. Dmitri Strukov

Rechnerstrukturen

Kismet: Parallel Speedup Estimates for Serial Programs

Chapter 1. The Computer Revolution

Understanding the tradeoffs and Tuning the methodology

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

CS 2630 Computer Organization. What did we accomplish in 8 weeks? Brandon Myers University of Iowa

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

Concurrency & Parallelism, 10 mi

CS 2630 Computer Organization. What did we accomplish in 15 weeks? Brandon Myers University of Iowa

An overview of standard cell based digital VLSI design

More Course Information

Towards Optimal Custom Instruction Processors

Architectural Musings

ASYNC Rik van de Wiel COO Handshake Solutions

TDT 4260 lecture 3 spring semester 2015

Emerging Platforms, Emerging Technologies, and the Need for Crosscutting Tools Luca Carloni

Stochastic Processors (or processors that do not always compute correctly by design)

Physical Synthesis and Electrical Characterization of the IP-Core of an IEEE-754 Compliant Single Precision Floating Point Unit

The Beauty and Joy of Computing

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Dr. Ajoy Bose. SoC Realization Building a Bridge to New Markets and Renewed Growth. Chairman, President & CEO Atrenta Inc.

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

MOS High Performance Arithmetic

Overview of Digital Design Methodologies

Xilinx SSI Technology Concept to Silicon Development Overview

EE-382M VLSI II. Early Design Planning: Front End

Introduction. CS 2210 Compiler Design Wonsun Ahn

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

Chip-Multithreading Systems Need A New Operating Systems Scheduler

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Spiral 2-8. Cell Layout

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

Architecture at the end of Moore

PACE: Power-Aware Computing Engines

Transcription:

Conservation Cores: Reducing the Energy of Mature Computations Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford Taylor Department of Computer Science and Engineering, University of California, San Diego 1

The Utilization Wall Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Experimental results Replicated small datapath More Dark Silicon than active Observations in the wild Flat frequency curve Turbo Mode Increasing cache/processor ratio Classical scaling Device count Device frequency Device power (cap) Device power (Vdd) Utilization S2 S 1/S 1/S2 1 Leakage limited scaling Device count Device frequency Device power (cap) Device power (Vdd) Utilization S2 S 1/S ~1 1/S2 2

The Utilization Wall Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! 1.0 Expected utilization for fixed area and power budget 0.9 2x 0.8 0.7 Experimental results Replicated small datapath More Dark Silicon than active Observations in the wild Flat frequency curve Turbo Mode Increasing cache/processor ratio 0.6 0.5 2x 0.4 0.3 2x 0.2 0.1 0.0 90nm 65nm 45nm 32nm 3

The Utilization Wall Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Experimental results Replicated small datapath More Dark Silicon than active Observations in the wild Flat frequency curve Turbo Mode Increasing cache/processor ratio Utilization @ 300mm 2 & 80w 0.20 0.18 17.6% 0.16 0.14 0.12 3x 0.10 0.08 6.5% 0.06 2x 3.3% 0.04 0.02 0.00 90nm 45nm 32nm TSMC TSMC ITRS 4

The Utilization Wall Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Experimental results Replicated small datapath More Dark Silicon than active Observations in the wild Flat frequency curve Turbo Mode Increasing cache/processor ratio Utilization @ 300mm 2 & 80w 0.20 0.18 17.6% 0.16 0.14 0.12 3x 0.10 0.08 6.5% 0.06 2x 3.3% 0.04 0.02 0.00 90nm 45nm 32nm TSMC TSMC ITRS 5

The Utilization Wall Scaling theory Transistor and power budgets no longer balanced Exponentially increasing problem! Experimental results Replicated small datapath More Dark Silicon than active Observations in the wild Flat frequency curve Turbo Mode Increasing cache/processor ratio We re already here Utilization @ 300mm 2 & 80w 0.20 0.18 17.6% 0.16 0.14 0.12 3x 0.10 0.08 6.5% 0.06 2x 3.3% 0.04 0.02 0.00 90nm 45nm 32nm TSMC TSMC ITRS 6

Spectrum of tradeoffs between # cores and frequency.. Utilization Wall: Dark Implications for Multicore 2x4 cores @ 3 GHz (8 cores dark) (Industry s Choice). e.g.; take 65 nm 32 nm; i.e. (s =2). 4 cores @ 3 GHz 4 cores @ 2x3 GHz (12 cores dark) 7 65 nm 32 nm

What do we do with Dark Silicon? Dark Silicon Insights: Power is now more expensive than area Specialized logic has been shown as an effective way to improve energy efficiency (10-1000x) Our Approach: Fill dark silicon with specialized cores to save energy on common apps Power savings can be applied to other program, increasing throughput C-cores provide an architectural way to trade area for an effective increase in power budget! 8

Conservation Cores Specialized cores for reducing energy Automatically generated from hot regions of program source Patching support future proofs HW D cache C-Core Fully automated toolchain Drop-in replacements for code Hot code implemented by C-Core, cold code runs on host CPU HW generation/sw integration Hot code Host CPU I cache (general purpose) Energy efficient Up to 16x for targeted hot code Cold code 9

The C-Core life cycle 10

Outline The Utilization Wall Conservation Core Architecture & Synthesis Patchable Hardware Results Conclusions 11

Constructing a C-Core C-Cores start with source code Parallelism agnostic C code supported Arbitrary memory access patterns Complex control flow Same cache memory model as processor Function call interface 12

Constructing a C-Core Compilation C-Core isolation SSA, infinite register, 3-address Direct mapping from CFG, DFG Scan chain insertion Verilog to Place & Route TSMC 45nm libraries Synopsys CAD flow Synthesis Placement Clock Tree Generation Routing 13

C-Core for sumarray Gold Control path Blue Registers Green Post-route Std. Cell layout of an actual C-Core generated by our toolchain Data path 0.01 mm2, 1.4 GHz 14

A C-Core enhanced system Tiled multiprocessor environment Homogeneous interfaces, heterogeneous resources Several C-Cores per tile Different types of C-cores on different tiles Each C-Core interfaces with 8-stage MIPS core Scan chains, cache as interfaces 15

Outline The Utilization Wall Conservation Core Architecture & Synthesis Patchable Hardware Results Conclusions 16

Patchable Hardware Future versions of hot code regions may have changes Need to keep HW usable C-Cores unaffected by changes to cold regions General exception mechanism Trap to SW Can support any changes 17

Reducing the cost of change Examined versions of applications as they evolved Many changes are straightforward to support Simple lightweight configurability Preserve structure Support only those changes commonly seen Structure Replaced by adder subtractor AddSub comparator(ge) Compare6 bitwise AND, OR, XOR BitwiseALU constant value 32-bit register 18

Patchability overheads Area overhead Split between generalized datapath elements and constant registers Power overhead 10-15% for generalized datapath elements Opportunity costs Reduced partial evaluation Can be large for multipliers, shifters 19

Patchability payoff: Longevity Graceful degradation Lower initial efficiency Much longer useful lifetime Increased viability With patching, utility lasts ~10 years for 4 out of 5 applications Decreases risks of specialization 20

Outline The Utilization Wall Conservation Core Architecture & Synthesis Patchable Hardware Results Conclusions 21

Automated measurement Source methodology C-Core toolchain Specification generator Verilog generator Synopsys CAD flow Design Compiler IC Compiler TSMC 45nm Simulation Hotspot analyzer Cold code Hot Code C-Core Rewriter specification generator Verilog generator gcc Validated cycle-accurate C-Core modules Post-route netlist simulation Simulation Synopsys flow Power measurement VCS+PrimeTime Power measurement 22

Our cadre of C-Cores We built 23 C-Cores for assorted versions of 5 applications Both patchable and nonpatchable versions of each Varied in size from 0.015 to 0.326 mm2 Frequencies from 0.9 to 1.9GHz 23

Per-function efficiency (work/j) C-Core hot-code energy efficiency 16 Software C-Core C-Core (code changed) 14 12 10 8 6 4 2 0 djpeg djpeg A B mcf A mcf B vpr A vpr B cjpeg cjpeg bzip2 Avg. A B A-F Up to 16x as efficient as general purpose in-order core, 9.5x on average 24

System energy efficiency C-Cores very efficient for targeted hot code Amdahl s Law limits total system efficiency 25

Normalized application EDP C-Core system efficiency with current toolchain 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Base Software djpeg djpeg A B Patchable mcf A mcf B vpr A +coverage vpr B +lowleak cjpeg cjpeg bzip2 Avg. A B A-F Avg 33% EDP improvement 26

Tuning system efficiency Improving our toolchain s coverage of hot code regions Good news: Small numbers of static instructions account for most of execution System rebalancing for coldcode execution Improve performance/leakage trade-offs for host core 27

Normalized application EDP C-Core system efficiency with toolchain improvements 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Software djpeg djpeg A B Patchable mcf A mcf B vpr A +coverage vpr B +lowleak cjpeg cjpeg bzip2 Avg. A B A-F With + low leakage system components Withcoverage improved coverage Avg savings Avg61% 53%EDP EDP improvement Avg 14% increased execution time 28

Conclusions The Utilization Wall will change how we build hardware Hardware specialization increasingly promising Conservation Cores are a promising way to attack the Utilization Wall Automatically generated patchable hardware For hot code regions: 3.4 16x energy efficiency With tuning: 61% application EDP savings across system 45nm tiled C-Core prototype under development @ UCSD Patchability allows C-Cores to last for ten years Lasts the expected lifetime of a typical chip 29

30