SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform

Similar documents
SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems

SH-MobileG1: A Single-Chip Application and Dual-mode Baseband Processor

MPSoC Approaches for Low-power Embedded Soc's

The ARM10 Family of Advanced Microprocessor Cores

A 1-GHz Configurable Processor Core MeP-h1

A 65nm Dual-mode Baseband and Multimedia Application Processor SoC with Advanced Power and Memory Management

The Nios II Family of Configurable Soft-core Processors

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

ARDUINO MEGA INTRODUCTION

Age nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

SH-Mobile LSIs for Cell Phones

ECE 2300 Digital Logic & Computer Organization. Caches

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Digital Semiconductor. StrongARMARM

SH4 RISC Microprocessor for Multimedia

AT-501 Cortex-A5 System On Module Product Brief

512 Kbit / 1 Mbit / 2 Mbit / 4 Mbit SPI Serial Flash SST25VF512 / SST25VF010 / SST25VF020 / SST25VF040

Caches. Hiding Memory Access Times

Low-power Architecture. By: Jonathan Herbst Scott Duntley

8 Mbit / 16 Mbit SPI Serial Flash SST25VF080 / SST25VF016

Low-Power Technology for Image-Processing LSIs

Unleashing the Power of Embedded DRAM

Last Time. Making correct concurrent programs. Maintaining invariants Avoiding deadlocks

Computer and Digital System Architecture

STM8L and STM32 L1 series. Ultra-low-power platform

Innovative Power Control for. Performance System LSIs. (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.)

Comparative Analysis of Contemporary Cache Power Reduction Techniques

PowerPC 740 and 750

OPENSPARC T1 OVERVIEW

Course Introduction. Purpose: Objectives: Content: Learning Time:

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

EE241 - Spring 2004 Advanced Digital Integrated Circuits

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

Crusoe Processor Model TM5800

Jim Keller. Digital Equipment Corp. Hudson MA

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Millimeter-Scale Nearly Perpetual Sensor System with Stacked Battery and Solar Cells

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Memory Systems IRAM. Principle of IRAM

Classification of Semiconductor LSI

Development of Low Power ISDB-T One-Segment Decoder by Mobile Multi-Media Engine SoC (S1G)

Introduction to Microprocessor

Processor Applications. The Processor Design Space. World s Cellular Subscribers. Nov. 12, 1997 Bob Brodersen (

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

PACE: Power-Aware Computing Engines

Atmel s s AT94K Series Field Programmable System Level Integrated Circuit (FPSLIC)

COSC 122 Computer Fluency. Computer Organization. Dr. Ramon Lawrence University of British Columbia Okanagan

SEIKO EPSON CORPORATION

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014

Hello, and welcome to this presentation of the STM32L4 power controller. The STM32L4 s power management functions and all power modes will also be

Design and Implementation of a FPGA-based Pipelined Microcontroller

SiFive FE310-G000 Manual c SiFive, Inc.

Development of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Arduino Uno R3 INTRODUCTION

Memory. Outline. ECEN454 Digital Integrated Circuit Design. Memory Arrays. SRAM Architecture DRAM. Serial Access Memories ROM

ECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

Embedded Systems Ch 15 ARM Organization and Implementation

PAN502x Capacitive Touch Controller Datasheet

A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache

Design and Simulation of Low Power 6TSRAM and Control its Leakage Current Using Sleepy Keeper Approach in different Topology

picojava I Java Processor Core DATA SHEET DESCRIPTION

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Runtime Power Management on SuperH Mobile

Topics in computer architecture

中显液晶 技术资料 中显控制器使用说明书 2009年3月15日 北京市海淀区中关村大街32号和盛大厦811室 电话 86 010 52926620 传真 86 010 52926621 企业网站.zxlcd.com

Design and Implementation of an AHB SRAM Memory Controller

General Purpose Signal Processors

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

RL78 Ultra Low Power MCU Lab

EECS 322 Computer Architecture Superpipline and the Cache

Remote Keyless Entry In a Body Controller Unit Application

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Offering compact implementation of sophisticated, high-performance telematics products and industrial equipment, and short development times

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5

Secure Microcontrollers for Smart Cards. AT90SC Summary

CS 152 Computer Architecture and Engineering

ARM processor organization

Achieves excellent performance of 1,920 MIPS and a single-chip solution for nextgeneration car information systems

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Product Technical Brief S3C2440X Series Rev 2.0, Oct. 2003

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

NTE1731 Integrated Circuit CMOS 10 Number Pulse Dialer

Memory Hierarchy and Caches

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Automatic Post Silicon Clock Scheduling 08/12/2008. UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering

AN4749 Application note

Hello and welcome to this Renesas Interactive module that provides an architectural overview of the RX Core.

2. Link and Memory Architectures and Technologies

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

CS 152 Computer Architecture and Engineering

Leakage Mitigation Techniques in Smartphone SoCs

Transcription:

SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform H. Mizuno, N. Irie, K. Uchiyama, Y. Yanagisawa 1, S. Yoshioka 1, I. Kawasaki 1, and T. Hattori 2 Hitachi Ltd., Tokyo, Japan 1 Renesas Technology Corp., Tokyo, Japan 2 SuperH Japan Ltd., Tokyo, Japan Outline Background Chip overview Active power reduction High MIPS/ MHz core Java accelerator Standby power reduction Low-power SoC design platform Supply domains and two standby modes (Resume and ultra standby modes) Summary 2 1

Background 3G cellular phone High data throughput (144k 2M bps) Advanced applications (Java, videophone & 3D CG) Long battery life (> 300 hours) RF Baseband processor Application processor Advanced process technology Higher operating speed, large amount of integration and lower leakage power are conflicting requirements. 3 Chip overview The Java TM, and all Java -based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. or other countries. 130-nm, Dual-Vth, Dual-tox CMOS (5Cu) technology Dedicated multiple computation engines: SuperH core (SH-X), inc. DSP & Java TM (BTU) engines MPEG-4 3D graphics 256-kB on-chip RAM (URAM) Low-power SoC design platform µi/ O (level shifter technology) On-chip power switches (PSWs) 7.6mm 3D Graphics Engine Processor Core PSW1 BTU LCDC MPEG-4 Video Interface PSW2 URAM 7.7mm 4 2

Chip diagram SH-Mobile3 RF Baseband Processor SH-X I$ 32kB BBIF LCDC DSP D$ 32kB BTU XY RAM Mem. Ctrl. VIF 3DG MPEG4 URAM 256kB PSC BTU: Java byte-code translation unit 3DG: 3D Graphics Engine VIF: Video Interface URAM: User RAM PSC: Power switch controller LCD SRAM/ FLASH SDRAM Camera 5 Active power reduction To achieve sufficient performance with minimum operation frequency and power consumption, High MIPS/ MHz core Optimized dual-issue 7-stage pipeline Dedicated multiple computation engines Java accelerator MPEG-4 3D graphics 6 3

Pipeline structure Dual-issue 7-stage Pipeline Higher MHz, but lower cycle performance Optimized pipeline using delayed execution enhances cycle performance. Delayed execution starting points I1 I2 ID E1 E2 E3 E4 E5 E6 E7 I nstruction Fetch Early Branch Decode Address Decode Execution Data Load Tag - Multiply ALU WB WB Data Store WB WB DSP 7 Delayed execution (DE) DE accelerates multiple-cycle and dependent Inst. flows. e.g. typical DSP instruction flow: Load --- Arithmetic Executions --- Store Conventional Architecture: 3-cycle Stalls Load: Multiply: Store: E1 E2 E3 E4 E1 E5 E2 E3 E1 E4 E2 E5 E3 E4 E5 Delayed Execution: No Pipeline Stall Load: Multiply: Store: MOVX.W @R4,X0 E1 E2 E3 E4 E5 PMULS X0,A0 E1 E2 E3 E4 E5 MOVX.W A0,@R5 E1 E2 E3 E4 E5 8 4

Performance evaluation MIPS/ MHz (ratio to SH-4) 105% 100% 95% 90% 85% 1.85 20.9% improvement 1.81 80% SH-X with all tech. SH-X without all tech. SH-4 5-stage pipeline 7-stage pipeline Benchmark : Dhrystone 2.1 9 Operating power of processor core 200 Benchmark : Dhrystone 2.1 Power [ mw] 150 100 50 0.57 mw/ MHz 0.40 mw/ MHz V DD = 1.2V 1.0V 1.8 MIPS/ MHz 0.40 mw/ MHz = 4500 MIPS/ W 0 180 200 220 240 260 Frequency [ MHz] 10 5

Java accelerator (BTU) SH-X Core DSP X-bus Y-bus XYRAM Cache- RAMbus BTU I-Cache ITLB UTLB D-Cache BIC L2 Cache UBC URAM Inst-bus Data-bus Internal bus BTU: UBC: Java byte-code Trans. Units User Break Controller URAM: BIC: User RAM Bus Interface Controller 11 BTU block diagram BTU Byte-code Fetch Byte-code 16 native code Inst. buffer Decoder ALU Translation Logic 4 10 extended code immediate Extended Decoder state info Config reg. State control Exception control Pipeline control Register File BTU-bus BIC I-cache D-cache Internal bus Inst-bus Data-bus 12 6

Parallel execution in BTU BTU shares control information and data with. It enables parallel execution of data and control processing. (e.g. Java exception detection) Coprocessor type Conv. accelerator BTU Register File Coprocessor ALU Accelerator Accelerator Pipeline status Exception detect Cache Data separated Cache Data shared Cache Control shared 13 Java power evaluation Performance: w/ BTU 6.55 ECM/ MHz (basic VM 0.64 ECM/ MHz) Power consumption is reduced by 6 %, and power/ ECM is reduced by 90 %. 0.00 0.20 0.40 0.60 0.80 1.00 1.20 (relative power) Power/ Clock frequency 6% less power basicvm (no opt) w/btu Power/ ECM Performance 90% less power per ECM Evaluation board 216 MHz, CLDC 1.0.4 14 7

Standby power reduction To achieve lower standby power with minimum speed overhead, Low-Power SoC Design Platform On-chip power switches ( PSWs) µi/ O Low leakage data-retention RAM technology Two Standby modes Resume standby mode Ultra standby mode 15 Low-power SoC design platform (PSWs) Thick-tox High-Vth NMOS transistors are used for on-chip power switches (PSWs). It minimizes various leakage currents such as subthreshold, gate tunneling, GI DL, and junction leakage. Domain x: ON/ OFF Logic Gate tunneling leakage Source Gate Gate-induced drain leakage (GIDL) Drain Leakage currents Local vss PSWC Subthreshold leakage Body Junction leakage vss 16 8

Low-power SoC design platform (µi/ O) µi/o has level-shift function and provides optimal supply & voltage domains for dedicated multiple computation engines. It also prevents invalid signal transmission and supports: Internal vss1 and/ or vss2 shutdown by on-chip power switches External vdd1 and/ or vdd2 shutdown by off-chip regulators vdd1 Domain 1: Off Logic µi/o vdd2 Domain 2: On Logic Local vss PSWC Local vss PSWC vss 17 Low-leakage data-retention memory Hierarchical on-chip power switches in SRAM provide subdivisional power-line control. In active mode Vssm, Vssa, Vssc = Vss Vddw = Vdd (sel.) ~ 0.4 V down ( unsel.) Local Vss = Vss In retention mode Vssa, Vssc: Hi-Z Vssm: ~ 0.4 V up Vddw : ~ 0.4 V down Local Vss = Vss In shut-down mode Local Vss: Hi-Z Lkg. Ctrl. WD Drv. ctrl. Vssc Vddw Memory Cell Sense Amp. Vssa Vss Vdd Vssm Local Vss 18 9

Leakage current of the memory (µa) 0 200 400 600 800 1000 Conventional Memory cell Word driver Amp 920 µa Proposed (in active) -25% 700 µa Proposed (in retention) -95% 50 µa 256-kB, Room Temp. V DD = 1.2 V 19 Two low-power modes Ultra standby Low leakage ( ~ 10 µa) Resume standby Low leakage ( ~ 100 µa) Quick recovery (< 3 ms) PSW1 core IP & Peri. Core (1.2 V) PSW1 core IP & Peri. Core (1.2 V) PSW2 URAM Bkup. Reg. PSW2 URAM Bkup. Reg. PSC I/ O ctrl. PSC I/ O ctrl. I/O I/ O (2.85 V) I/O I/ O (2.85 V) 20 10

R-standby recovery operation Hardware operation Power switch control Clock generation (PLL, D.PLL lock) Data backup using backup latch BAR (Boot Address Register) holds restart address Clock and interrupt setting needed just after wake-up Software operation URAM : data backup mem. Control registers OS task table etc. BAR, etc. PSW1 backup latch PSW2 Vdd URAM Vss 21 Recovery time from R-standby Total recovery time from R-standby mode is only 1.6 ms or 2.8 ms (@Ext. clk= 32 khz). w/ o D.PLL lock w/ D.PLL lock (Ext. CLK= 32kHz) PSW On PLL Lock-in D.PLL Lock-in State Transition Restore Reg. Restart Tasks 0 1 2 3 Time (ms) 22 11

Standby power consumption 2.2 ma Room Temp. V DD = 1.2 V Leakage current (µa) 2000 100 75 50 25 0 Standby w/ o power cutoff -96% 86 µa R-standby -99% 11 µa U-standby 23 Summary 130-nm 5-layer-Cu dual-vth, dual-tox CMOS technology Dedicated multiple computation engines: SuperH core (SH-X) including DSP & Java TM engines MPEG-4 3D graphics Power efficiency, SH-X: 4500 MIPS/ W Java: 6.55 ECM/ MHz Low-power SoC design platform On-chip power switches µi/o Low-leakage data-retention RAM Two standby modes (R-standby and U-standby) Leakage current: 86 µa and 11 µa Recovery time from R-standby: 1.6 ms or 2.8 ms (@Ext. clk= 32 khz) 24 12