SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform H. Mizuno, N. Irie, K. Uchiyama, Y. Yanagisawa 1, S. Yoshioka 1, I. Kawasaki 1, and T. Hattori 2 Hitachi Ltd., Tokyo, Japan 1 Renesas Technology Corp., Tokyo, Japan 2 SuperH Japan Ltd., Tokyo, Japan Outline Background Chip overview Active power reduction High MIPS/ MHz core Java accelerator Standby power reduction Low-power SoC design platform Supply domains and two standby modes (Resume and ultra standby modes) Summary 2 1
Background 3G cellular phone High data throughput (144k 2M bps) Advanced applications (Java, videophone & 3D CG) Long battery life (> 300 hours) RF Baseband processor Application processor Advanced process technology Higher operating speed, large amount of integration and lower leakage power are conflicting requirements. 3 Chip overview The Java TM, and all Java -based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. or other countries. 130-nm, Dual-Vth, Dual-tox CMOS (5Cu) technology Dedicated multiple computation engines: SuperH core (SH-X), inc. DSP & Java TM (BTU) engines MPEG-4 3D graphics 256-kB on-chip RAM (URAM) Low-power SoC design platform µi/ O (level shifter technology) On-chip power switches (PSWs) 7.6mm 3D Graphics Engine Processor Core PSW1 BTU LCDC MPEG-4 Video Interface PSW2 URAM 7.7mm 4 2
Chip diagram SH-Mobile3 RF Baseband Processor SH-X I$ 32kB BBIF LCDC DSP D$ 32kB BTU XY RAM Mem. Ctrl. VIF 3DG MPEG4 URAM 256kB PSC BTU: Java byte-code translation unit 3DG: 3D Graphics Engine VIF: Video Interface URAM: User RAM PSC: Power switch controller LCD SRAM/ FLASH SDRAM Camera 5 Active power reduction To achieve sufficient performance with minimum operation frequency and power consumption, High MIPS/ MHz core Optimized dual-issue 7-stage pipeline Dedicated multiple computation engines Java accelerator MPEG-4 3D graphics 6 3
Pipeline structure Dual-issue 7-stage Pipeline Higher MHz, but lower cycle performance Optimized pipeline using delayed execution enhances cycle performance. Delayed execution starting points I1 I2 ID E1 E2 E3 E4 E5 E6 E7 I nstruction Fetch Early Branch Decode Address Decode Execution Data Load Tag - Multiply ALU WB WB Data Store WB WB DSP 7 Delayed execution (DE) DE accelerates multiple-cycle and dependent Inst. flows. e.g. typical DSP instruction flow: Load --- Arithmetic Executions --- Store Conventional Architecture: 3-cycle Stalls Load: Multiply: Store: E1 E2 E3 E4 E1 E5 E2 E3 E1 E4 E2 E5 E3 E4 E5 Delayed Execution: No Pipeline Stall Load: Multiply: Store: MOVX.W @R4,X0 E1 E2 E3 E4 E5 PMULS X0,A0 E1 E2 E3 E4 E5 MOVX.W A0,@R5 E1 E2 E3 E4 E5 8 4
Performance evaluation MIPS/ MHz (ratio to SH-4) 105% 100% 95% 90% 85% 1.85 20.9% improvement 1.81 80% SH-X with all tech. SH-X without all tech. SH-4 5-stage pipeline 7-stage pipeline Benchmark : Dhrystone 2.1 9 Operating power of processor core 200 Benchmark : Dhrystone 2.1 Power [ mw] 150 100 50 0.57 mw/ MHz 0.40 mw/ MHz V DD = 1.2V 1.0V 1.8 MIPS/ MHz 0.40 mw/ MHz = 4500 MIPS/ W 0 180 200 220 240 260 Frequency [ MHz] 10 5
Java accelerator (BTU) SH-X Core DSP X-bus Y-bus XYRAM Cache- RAMbus BTU I-Cache ITLB UTLB D-Cache BIC L2 Cache UBC URAM Inst-bus Data-bus Internal bus BTU: UBC: Java byte-code Trans. Units User Break Controller URAM: BIC: User RAM Bus Interface Controller 11 BTU block diagram BTU Byte-code Fetch Byte-code 16 native code Inst. buffer Decoder ALU Translation Logic 4 10 extended code immediate Extended Decoder state info Config reg. State control Exception control Pipeline control Register File BTU-bus BIC I-cache D-cache Internal bus Inst-bus Data-bus 12 6
Parallel execution in BTU BTU shares control information and data with. It enables parallel execution of data and control processing. (e.g. Java exception detection) Coprocessor type Conv. accelerator BTU Register File Coprocessor ALU Accelerator Accelerator Pipeline status Exception detect Cache Data separated Cache Data shared Cache Control shared 13 Java power evaluation Performance: w/ BTU 6.55 ECM/ MHz (basic VM 0.64 ECM/ MHz) Power consumption is reduced by 6 %, and power/ ECM is reduced by 90 %. 0.00 0.20 0.40 0.60 0.80 1.00 1.20 (relative power) Power/ Clock frequency 6% less power basicvm (no opt) w/btu Power/ ECM Performance 90% less power per ECM Evaluation board 216 MHz, CLDC 1.0.4 14 7
Standby power reduction To achieve lower standby power with minimum speed overhead, Low-Power SoC Design Platform On-chip power switches ( PSWs) µi/ O Low leakage data-retention RAM technology Two Standby modes Resume standby mode Ultra standby mode 15 Low-power SoC design platform (PSWs) Thick-tox High-Vth NMOS transistors are used for on-chip power switches (PSWs). It minimizes various leakage currents such as subthreshold, gate tunneling, GI DL, and junction leakage. Domain x: ON/ OFF Logic Gate tunneling leakage Source Gate Gate-induced drain leakage (GIDL) Drain Leakage currents Local vss PSWC Subthreshold leakage Body Junction leakage vss 16 8
Low-power SoC design platform (µi/ O) µi/o has level-shift function and provides optimal supply & voltage domains for dedicated multiple computation engines. It also prevents invalid signal transmission and supports: Internal vss1 and/ or vss2 shutdown by on-chip power switches External vdd1 and/ or vdd2 shutdown by off-chip regulators vdd1 Domain 1: Off Logic µi/o vdd2 Domain 2: On Logic Local vss PSWC Local vss PSWC vss 17 Low-leakage data-retention memory Hierarchical on-chip power switches in SRAM provide subdivisional power-line control. In active mode Vssm, Vssa, Vssc = Vss Vddw = Vdd (sel.) ~ 0.4 V down ( unsel.) Local Vss = Vss In retention mode Vssa, Vssc: Hi-Z Vssm: ~ 0.4 V up Vddw : ~ 0.4 V down Local Vss = Vss In shut-down mode Local Vss: Hi-Z Lkg. Ctrl. WD Drv. ctrl. Vssc Vddw Memory Cell Sense Amp. Vssa Vss Vdd Vssm Local Vss 18 9
Leakage current of the memory (µa) 0 200 400 600 800 1000 Conventional Memory cell Word driver Amp 920 µa Proposed (in active) -25% 700 µa Proposed (in retention) -95% 50 µa 256-kB, Room Temp. V DD = 1.2 V 19 Two low-power modes Ultra standby Low leakage ( ~ 10 µa) Resume standby Low leakage ( ~ 100 µa) Quick recovery (< 3 ms) PSW1 core IP & Peri. Core (1.2 V) PSW1 core IP & Peri. Core (1.2 V) PSW2 URAM Bkup. Reg. PSW2 URAM Bkup. Reg. PSC I/ O ctrl. PSC I/ O ctrl. I/O I/ O (2.85 V) I/O I/ O (2.85 V) 20 10
R-standby recovery operation Hardware operation Power switch control Clock generation (PLL, D.PLL lock) Data backup using backup latch BAR (Boot Address Register) holds restart address Clock and interrupt setting needed just after wake-up Software operation URAM : data backup mem. Control registers OS task table etc. BAR, etc. PSW1 backup latch PSW2 Vdd URAM Vss 21 Recovery time from R-standby Total recovery time from R-standby mode is only 1.6 ms or 2.8 ms (@Ext. clk= 32 khz). w/ o D.PLL lock w/ D.PLL lock (Ext. CLK= 32kHz) PSW On PLL Lock-in D.PLL Lock-in State Transition Restore Reg. Restart Tasks 0 1 2 3 Time (ms) 22 11
Standby power consumption 2.2 ma Room Temp. V DD = 1.2 V Leakage current (µa) 2000 100 75 50 25 0 Standby w/ o power cutoff -96% 86 µa R-standby -99% 11 µa U-standby 23 Summary 130-nm 5-layer-Cu dual-vth, dual-tox CMOS technology Dedicated multiple computation engines: SuperH core (SH-X) including DSP & Java TM engines MPEG-4 3D graphics Power efficiency, SH-X: 4500 MIPS/ W Java: 6.55 ECM/ MHz Low-power SoC design platform On-chip power switches µi/o Low-leakage data-retention RAM Two standby modes (R-standby and U-standby) Leakage current: 86 µa and 11 µa Recovery time from R-standby: 1.6 ms or 2.8 ms (@Ext. clk= 32 khz) 24 12