CSE 291: Mobile Application Processor Design

Similar documents
GreenDroid: A Mobile Application Processor for a Future of Dark Silicon

GreenDroid: An Architecture for the Dark Silicon Age

Conservation Cores: Reducing the Energy of Mature Computations

Conservation Cores: Reducing the Energy of Mature Computations

GREENDROID: AN ARCHITECTURE FOR DARK SILICON AGE. February 23, / 29

Computer Architecture

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

Multicore Hardware and Parallelism

Computer Architecture!

Parallelism in Hardware

CS671 Parallel Programming in the Many-Core Era

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Outline Marquette University

Microprocessor Trends and Implications for the Future

Computer Architecture!

Computer Architecture

Computer Architecture!

Dark Silicon and its Implications for Future Processor Design

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Unit 2: Hardware Background

Instruction Fusion for Multiscalar and Many-Core Processors

Power Limitations and Dark Silicon Challenge the Future of Multicore

Lecture 1: Introduction and Basics

When Girls Design CPUs!

CS/EE 6810: Computer Architecture

CS516 Programming Languages and Compilers II

Implications of the Power Wall: Dim Cores and Reconfigurable Logic

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Fundamentals of Quantitative Design and Analysis

CSE : Introduction to Computer Architecture

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

Fundamentals of Computer Design

COMP 322: Fundamentals of Parallel Programming

CMPEN 411 VLSI Digital Circuits. Lecture 01: Introduction

Embedded Systems: Architecture

DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems

Administrative matters. EEL-4713C Computer Architecture Lecture 1. Overview. What is this class about?

Chapter 1: Fundamentals of Quantitative Design and Analysis

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

Lab. Course Goals. Topics. What is VLSI design? What is an integrated circuit? VLSI Design Cycle. VLSI Design Automation

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Memory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012

CS5222 Advanced Computer Architecture. Lecture 1 Introduction

CS61C Machine Structures. Lecture 1 Introduction. 8/27/2006 John Wawrzynek (Warzneck)

ARM Processor. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

45-year CPU evolution: one law and two equations

ECE 486/586. Computer Architecture. Lecture # 2

VLSI Design Automation

VLSI Design Automation

Low-power Architecture. By: Jonathan Herbst Scott Duntley

EE241 - Spring 2004 Advanced Digital Integrated Circuits

ECE484 VLSI Digital Circuits Fall Lecture 01: Introduction

ECE571: Advanced Microprocessor Design Final Project Spring Officially Due: Friday, 4 May 2018 (Last day of Classes)

Suggested Readings! Lecture 24" Parallel Processing on Multi-Core Chips! Technology Drive to Multi-core! ! Readings! ! H&P: Chapter 7! vs.! CSE 30321!

Computer Architecture. Fall Dongkun Shin, SKKU

EE 4980 Modern Electronic Systems. Processor Advanced

Advanced Computer Architecture (CS620)

Evaluating Orthogonality between Application Auto tuning and Run Time Resource Management for Adaptive OpenCL Applications

ECE 588/688 Advanced Computer Architecture II

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

The Effect of Temperature on Amdahl Law in 3D Multicore Era

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Fundamentals of Computers Design

Transistors and Wires

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

An Introduction to Parallel Programming

Energy Efficiency and Resilience in Future ICs

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Practical Information

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

EE382 Processor Design. Class Objectives

Performance of computer systems

+ CS263: Runtime Systems

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Near-Threshold Computing: Reclaiming Moore s Law

ECE 571 Advanced Microprocessor-Based Design Lecture 24

45-year CPU Evolution: 1 Law -2 Equations

Architecture at the end of Moore

Moore s s Law, 40 years and Counting

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

ECE 471 Embedded Systems Lecture 2

ECE520 VLSI Design. Lecture 1: Introduction to VLSI Technology. Payman Zarkesh-Ha

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

How What When Why CSC3501 FALL07 CSC3501 FALL07. Louisiana State University 1- Introduction - 1. Louisiana State University 1- Introduction - 2

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

CS 194 Parallel Programming. Why Program for Parallelism?

VLSI Design Automation. Maurizio Palesi

Lecture 1: Welcome, why are you here? James C. Hoe Department of ECE Carnegie Mellon University

Full Name: NetID: Midterm Summer 2017

Why Parallel Architecture

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Concurrency & Parallelism, 10 mi

The Art of Parallel Processing

Lecture 2: Performance

Mobile Processors. Jose R. Ortiz Ubarri

Copyright 2012, Elsevier Inc. All rights reserved.

Transcription:

CSE 291: Mobile Application Processor Design

Mobile Application Processors are where the action are The evolution of mobile application processors mirrors that of microprocessors mirrors that of mainframes.. Not much left to borrow! Faster iterative cycle than desktops: obsolete vs. worn out Mobile Application Processors are the new centroid of microprocessor evolution! multicore out-of-order superscalar pipelining 486 Intel ARM 586 686 StrongARM Core Duo Cortex-A9 Cortex-A8 Cortex-A9 MPCore 1985 1990 1995 2000 2005 2010 2015

Desktop micros: frequency versus time 3800 20 yr / 10x (12%) 5 yr / 10x (58%) 7 yr / 10x (39%)

Power Density 1000 Watts/cm 2 100 10 1 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ

Course Staff Prof. Michael Taylor CSE 3202 Office Hours: Right After Class Dr. Jack Sampson CSE 3204 Office Hours: TBD

Where to find information: Wherever you can get it! IEEE Explore website, free from UCSD network ISSCC (Chip Tapeout Papers) ACM Portal website, free from UCSD network google.com I m serious! Powerpoint presentations by Qualcomm, Nvidia, Etc. Commentary articles Clever search terms Reverse engineering sites (chipworks) Microprocessor Report leaked articles anandtech.com Technical fanboy site fandroid.com etc I m serious! qualcomm, ti, nvidia, google, arm developer sites and blogs

Project Structure First Two Weeks: Literature Review Related Work Project Proposal Last 6 Weeks Project Midpoint Review Final Project Report 6 page DAC submission DAC submission deadline shortly after class ends Final Project Presentation Last Day of Class

NVidia Tegra 2: Example MAP

Discussion RISC vs CISC Battle, Round II Vertical Integration (Apple) versus Competitive Bazaar (Wintel/ AndroidQualTISamidia) AMD vs. Intel Nvidia vs. Intel Nvidia vs. AMD/ATI Mobile Monopoly Theory Qualcomm (with AMD Adreno) Nvidia (with Icera) TI (historical, with OMAP) Apple Soon enough Mobile Snafus Intel sells Xscale to Marvl AMD sold Adreno Mobile GPU ($65m) to Qcom

GreenDroid: An Architecture for the Dark Silicon Era UCSD Center for Dark Silicon Department of Computer Science and Engineering, University of California, San Diego

This Talk The Dark Silicon Problem How to use Dark Silicon to improve energy efficiency (Conservation Cores) The GreenDroid Mobile Application Processor GreenDroid Highlights

Where does dark silicon come from? And how dark is it going to be? The Utilization Wall: With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. [Venkatesh, Chakraborty]

We've Hit The Utilization Wall Utilization Wall: With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. Scaling theory Transistor and power budgets are no longer balanced Exponentially increasing problem! Experimental results Replicated a small datapath More "dark silicon" than active Observations in the wild Flat frequency curve "Turbo Mode" Increasing cache/processor ratio

Scaling 101: Moore s Law 90 65 45 32 22 16 11 8 nm S = 22 16 = ~1.4x

Scaling 101: Transistors scale as S2 180 nm 16 cores MIT Raw S = 2x Transistors = 4x 90 nm 64 cores Tilera TILE64

Advanced Scaling: Dennard: Computing Capabilities Scale by S 3 = 2.8x If S=1.4x S 3 S 2 S 1 Design of Ion-Implanted MOSFETs with Very Small Dimensions Dennard et al, 1974

Advanced Scaling: Dennard: Computing Capabilities Scale by S 3 = 2.8x If S=1.4x S 3 S 2 = 2x More Transistors S 2 S 1

Advanced Scaling: Dennard: Computing Capabilities Scale by S 3 = 2.8x If S=1.4x S = 1.4x Faster Transistors S 2 = 2x More Transistors S 3 S 2 S 1

Advanced Scaling: Dennard: Computing Capabilities Scale by S 3 = 2.8x If S=1.4x S = 1.4x Faster Transistors S 2 = 2x More Transistors But wait: switching 2.8x times as many transistors per unit time what about power?? S 3 S 2 S 1

Dennard: We can keep power consumption constant S = 1.4x Faster Transistors S 2 = 2x More Transistors S = 1.4x Lower Capacitance S 3 S 2 S 1

Dennard: We can keep power consumption constant S = 1.4x Faster Transistors S 2 = 2x More Transistors S = 1.4x Lower Capacitance Scale Vdd by S=1.4x S 2 = 2x S 3 S 2 S 1

Dennard: We can keep power consumption constant S = 1.4x Faster Transistors S 2 = 2x More Transistors S = 1.4x Lower Capacitance Scale Vdd by S=1.4x S 2 = 2x S 3 S 2 S 1

Fast forward to 2005: Threshold Scaling Problems due to Leakage Prevents Us From Scaling Voltage S = 1.4x Faster Transistors S 2 = 2x More Transistors S = 1.4x Lower Capacitance Scale Vdd by S=1.4x S 2 = 2x S 3 S 2 S 1

Fast forward to 2005: Threshold Scaling Problems due to Leakage Prevents Us From Scaling Voltage S = 1.4x Faster Transistors S 2 = 2x More Transistors S = 1.4x Lower Capacitance Scale Vdd by S=1.4x S 2 = 2x S 3 S 2 S 1

Full Chip, Full Frequency Power Dissipation Is increasing exponentially by 2x with every process generation S 3 S 2 Factor of S 2 = 2X shortage!! S 1 [ASPLOS 2010, Venkatesh]

We've Hit The Utilization Wall Utilization Wall: With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. Scaling theory Transistor and power budgets are no longer balanced Exponentially increasing problem! Experimental results Replicated a small datapath More "dark silicon" than active Observations in the wild Flat frequency curve "Turbo Mode" Increasing cache/processor ratio Classical scaling Device count S 2 Device frequency S Device cap (power) 1/S Device V dd (power) 1/S 2 Utilization? Leakage-limited scaling Device count S 2 Device frequency S Device cap->power 1/S Device V dd (power) ~1 Utilization? 1 S 2

We've Hit The Utilization Wall Utilization Wall: With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. Scaling theory Transistor and power budgets are no longer balanced Exponentially increasing problem! Experimental results Replicated a small datapath More "dark silicon" than active Observations in the wild Flat frequency curve "Turbo Mode" Increasing cache/processor ratio Classical scaling Device count S 2 Device frequency S Device cap (power) 1/S Device V dd (power) 1/S 2 Utilization? 1 Leakage-limited scaling Device count S 2 Device frequency S Device cap->power 1/S Device V dd (power) ~1 Utilization? 1/S 2 1 S 2

We've Hit The Utilization Wall Utilization Wall: With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. Scaling theory Transistor and power budgets are no longer balanced Exponentially increasing problem! Experimental results Replicated a small datapath More "dark silicon" than active Observations in the wild Flat frequency curve "Turbo Mode" Increasing cache/processor ratio 2x 2x 2x

We've Hit The Utilization Wall Utilization Wall: With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. Scaling theory Transistor and power budgets are no longer balanced Exponentially increasing problem! Experimental results Replicated a small datapath More "dark silicon" than active Observations in the wild Flat frequency curve "Turbo Mode" Increasing cache/processor ratio 2.8x 2x

We've Hit The Utilization Wall Utilization Wall: With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. Scaling theory Transistor and power budgets are no longer balanced Exponentially increasing problem! Experimental results Replicated a small datapath More "dark silicon" than active Observations in the wild Flat frequency curve "Turbo Mode" Increasing cache/processor ratio 2.8x 2x

We've Hit The Utilization Wall Utilization Wall: With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. Scaling theory Transistor and power budgets are no longer balanced The utilization wall will change the way everyone builds processors. Exponentially increasing problem! Experimental results Replicated a small datapath More "dark silicon" than active Observations in the wild Flat frequency curve "Turbo Mode" Increasing cache/processor ratio 2.8x 2x

Utilization Wall: Dark Silicon Leads to Dark Implications for Multicore Spectrum of tradeoffs between # of cores and frequency Example: 65 nm 32 nm (S = 2) 4 cores @ 1.8 GHz... 2x4 cores @ 1.8 GHz (8 cores dark, 8 dim) (Industry s Choice, next slide) 4 cores @ 2x1.8 GHz (12 cores dark) 65 nm 32 nm [Hotchips 2010]

Utilization Wall in Real Life 45 nm 32 nm (S=1.4x) 45 nm Nehalem 3.2 GHz 4 cores 120 W S=1.4x Δ cores = 1.5x Δ freq = 1.03x 32 nm Gulftown 3.3 GHz 6 cores 120 W As predicted by the utilization wall, compute capability scaled as S and not as S3 Improvements in performance are gated by improvements in energy efficiency, which primarily result from decreases in capacitance, not improvements in frequency or area.