Real-Time Dynamic Voltage Hopping on MPSoCs

Similar documents
Real-Time Dynamic Energy Management on MPSoCs

Last Time. Making correct concurrent programs. Maintaining invariants Avoiding deadlocks

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

Part IV: 3D WiNoC Architectures

Development of Low Power ISDB-T One-Segment Decoder by Mobile Multi-Media Engine SoC (S1G)

Single-Path Programming on a Chip-Multiprocessor System

Blackfin Optimizations for Performance and Power Consumption

SDR Forum Technical Conference 2007

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Crusoe Power Management:

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Voltage Scheduling in the lparm Microprocessor System

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

General Purpose Signal Processors

Dynamic Voltage Scaling of Periodic and Aperiodic Tasks in Priority-Driven Systems Λ

An introduction to SDRAM and memory controllers. 5kk73

Applications. Department of Computer Science, University of Pittsburgh, Pittsburgh, PA fnevine, mosse, childers,

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 1-GHz Configurable Processor Core MeP-h1

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

Information Processing. Peter Marwedel Informatik 12 Univ. Dortmund Germany

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

Dual-Processor Design of Energy Efficient Fault-Tolerant System

Optimization for Real-Time Systems with Non-convex Power versus Speed Models

An Efficient Approach to Energy Saving in Microcontrollers

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Configurable and Extensible Processors Change System Design. Ricardo E. Gonzalez Tensilica, Inc.

Frequency and Voltage Scaling Design. Ruixing Yang

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

MICROPROCESSOR ARCHITECTURE

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation

Low Power System Design

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

ENERGY EFFICIENT SCHEDULING FOR REAL-TIME EMBEDDED SYSTEMS WITH PRECEDENCE AND RESOURCE CONSTRAINTS

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

Dynamic Voltage Scaling on a Low-Power Microprocessor

PACE: Power-Aware Computing Engines

A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan

SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems

Hardware-Accelerated Dynamic Binary Translation

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

Voltage Scheduling in the lparm Microprocessor System

Mapping C code on MPSoC for Nomadic Embedded Systems

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

ARM Processors for Embedded Applications

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

TIMA Lab. Research Reports

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Minimizing Energy Consumption for High-Performance Processing

SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

Single Chip Heterogeneous Multiprocessor Design

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)

Low-Power Scheduling with DVFS for common RTOS on Multicore Platforms

Low-Power Technology for Image-Processing LSIs

A Non-Volatile Microcontroller with Integrated Floating-Gate Transistors

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

POWER consumption has become one of the most important

High performance, power-efficient DSPs based on the TI C64x

Network-on-Chip Architecture

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Energy Aware Computing in Cooperative Wireless Networks

ENERGY EFFICIENT SCHEDULING SIMULATOR FOR DISTRIBUTED REAL-TIME SYSTEMS

Innovative Power Control for. Performance System LSIs. (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.)

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2

Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC

Computer Architecture

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Development of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications

PROF. TAJANA SIMUNIC ROSING. Final Exam. Problem Max. Points Points 1 25 T/F 2 10 Esterel 3 20 Petri net 4 20 SDF 5 20 EDF/RM 6 20 LS/power Total 115

Hardware/Software Codesign of Schedulers for Real Time Systems

Dynamic Memory Management for Real-Time Multiprocessor System-on-a-Chip

Scalable series-stacked power delivery architectures for improved efficiency and reduced supply current

Research Scholar, Chandigarh Engineering College, Landran (Mohali), 2

A Practical Dynamic Frequency Scaling Solution to DPM in Embedded Linux Systems

Ready, Fire, Aim 20 years of hits and misses at Hot Chips

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

POWER REDUCTION IN CONTENT ADDRESSABLE MEMORY

Implementation Synthesis of Embedded Software under Operating Systems Supporting the Hybrid Scheduling Model

CEC 450 Real-Time Systems

UML-Based Analysis of Power Consumption for Real-Time Embedded Systems

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Asia and South Pacific Design Automation Conference

New STM32 F7 Series. World s 1 st to market, ARM Cortex -M7 based 32-bit MCU

USB for Embedded Devices. Mohit Maheshwari Prashant Garg

NETWORKS on CHIP A NEW PARADIGM for SYSTEMS on CHIPS DESIGN

Minje Jun. Eui-Young Chung

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Stack Frames Placement in Scratch-Pad Memory for Energy Reduction of Multi-task Applications

Towards Optimal Custom Instruction Processors

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

DYNAMIC SCHEDULING OF SKIPPABLE PERIODIC TASKS WITH ENERGY EFFICIENCY IN WEAKLY HARD REAL-TIME SYSTEM

Imaging Solutions by Mercury Computer Systems

Embedded Systems: Architecture

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

Transcription:

Real-Time Dynamic Voltage Hopping on MPSoCs Tohru Ishihara System LSI Research Center, Kyushu University 2009/08/05 The 9 th International Forum on MPSoC and Multicore 1

Background Low Power / Low Energy Growth of portable electronics market Peak Performance Growth of real-time applications Scalability and adaptability Energy saving without losing the peak performance Real-time DVS on a multi-core processor 2

DVS Processor Processor can dynamically change its clock frequency Programmers can specify the clock frequency using an I/ O instruction The operating voltage is adjusted to the minimum value which guarantees correct operations of the processor Program RAM addi r1,r2,#1 switch to 100MHz Beqz r1,#-128 code decode Power status reg. data Data RAM Power Supply DC-DC converter VCO clock voltage T. Ishihara and H. Yasuura, Voltage Scheduling Problem for Dynamically Variable Voltage Processor, in Proc. of ISLPED 98, pp.197-202, Aug., 1998. 3 3

DVS Pros and Cons Pros Quadratic reduction of energy consumption Cons Low performance at low operating voltage Large overhead of DC-DC converter t TRAN E TRAN 2 C = V V IMAX = (1 η) C V DD1 DD2 V 2 2 DD1 DD2 C=100μF I MAX =1A This prevents real-time control of DVS processors Typically 0.9 T. Burd and R. Brodersen, Design Issues for Dynamic Voltage Scaling, ISLPED 2000. V DD1 =1.0 V DD2 =0.68 t TRAN =64μs, E TRAN =4.6μJ 4

Multi-Performance Processor (1/2) PEs have the same ISA Differ in their clock speeds and energy consumptions A single PE is activated at a time Dedicated Bus PE1 High V DD PE2 Middle V DD PE3 Low V DD Internal register values are transferred through a dedicated bus or a stack located in a data RAM before switching PEs Not an ILP processor nor a multi-core processor I-cache Program RAM Data RAM memory Cache ways activated can be specified by program 5

Multi-Performance Processor (2/2) Inter MPU cores: multiple MPU cores run concurrently Intra MPU core: a single PE core runs alternatively PE core PE core PE core MPU cores L1-cache No data migration is needed when a PE-core switches L2 cache No bus arbitration is needed No multi-port RAM is needed 6

Overview of Prototype Based on Media embedded Processor (MeP) originally developed by Toshiba Designed with commercial 90nm process technology 7

Synthesis Flow Characterize cells using different voltages 1.0V, 0.75V, 0.72V, 0.7V, 0.68V, 0.55V, 0.52V, 0.5V Determine clock frequency of H-PE 200MHz with 1.0V Determine bus clock frequency 66MHz which is a quotient of 200MHz Determine clock speed and V DD of L-PE 66MHz@0.52V which minimizes the energy of L-PE Determine clock speed and V DD of M-PE 133MHz@0.68V which minimizes the energy of M-PE 8

Comparison with DVS (1/2) Ours 15 GP registers are transferred through stack 16 SP registers are transferred through dedicated bus The other registers like pipeline registers are flushed Commercial DVS processor Dedicated Bus Transition time L-PE 67MHz 0.52V M-PE 133MHz 0.68V H-PE 200MHz 1.00V 1μs Energy overhead 13nJ Level conver ter Level conver ter Level conver ter Level conver ter Stack (scratchpad memory) Processor RTOS Transition time AMD Mobile K6 [1] PowerNow! 200 μsec StrongARM SA-2 [2] Embedded Linux 150 μsec ~10 μj energy overhead is involved Transmeta Crusoe LongRun > 20 μsec [3] [1]AMD, AMD PowerNow! Technology Dynamically Manages Power and Performance [2] Compaq ipaq H3600 hardware design specification ver.0.2f, http://www.handhelds.org/compaq/ipaqh3600/ipa [3]M. Fleischmann, Reducing x86 operating power through LongRun IEEE 12 th Hot Chips Symposium, August 15, 2000 9

Comparison with DVS (2/2) Extract a critical path from our 1.0V design Measure the delay of the path using HSPICE for different operating voltages Performance of low voltage operation is not very good DVS Processor 35% energy reduction!! Our Processor 31% energy reduction!! 10

MPP Pros and Cons Pros Low transition overhead Two orders of magnitude less than that of DVS Higher performance at low operating voltage Each PE core is optimized for the target supply voltage Cons Large area overhead Limited number of clock speeds Need multiple voltage sources Memory H-PE M-PE DC-DC DC-DC 11

Comparison with CMP Heterogeneous chip multi-processor (CMP) Processor1 runs faster and consumes higher energy than Processor2 Task migration can be a better solution if it reduces the energy w/o violating a real-time constraint Processor1 CPU-core1 cache SPM Overheads 30μs 3μJ Task migration Processor2 CPU-core2 cache SPM 12

Power Results of MPEG2 encode Original 1.0V/200MHz 0.68V/133MHz 0.52V/67MHz Stack is placed in the data scratchpad memory HW configuration is fixed before entering the main routine 13

Energy and Execution Time 2x energy scalability 14

Application of DVS Processor Motivation Most tasks complete much earlier than their WCET Approach (for periodical tasks only) Divide a 1-frame video encoding task into 33 slices Exploit slack time of previous slices in the current slice Application 66.67ms Frame1 Frame2 Frame15 Slice 1 Slice 2 Slice 3 2.02ms WCET TS1 TS2 TS3 TS33 0.5ms for transition 1 2 3 33 slices S. Lee and T. Sakurai, Run-time voltage hopping for low-power real-time systems, in Proc. Asia South Pacific Design Automation Conference., Jan. 2000, pp. 381-386. 15

Experiments Overheads 26μs 164nJ DC-DC & VCO H-PE 35mW@200MHz DVS VDD F DVS Processor 34mW@200MHz 18mW@133MHz T. Burd and R. Brodersen, Design Issues for Dynamic Voltage Scaling, ISLPED 2000. MPP (ours) Overheads 1μs,13nJ Shared local-memory M-PE 14mW@133MHz Normalized Energy Orig. DVSMPP [ms] Average execution time for each slice 16

Layout Image of MPU-core1 Designed with commercial 90nm CMOS technology D-SPM 0.62mm 2 15.0% I-SPM 0.40mm 2 9.5% I-Cache 0.74mm 2 17.5% High-end PE 0.19mm 2 4.5% Middle-end PE 0.21mm 2 5.0% Others 2.07mm 2 48.5% Total 4.23mm 2 100% 17

Summary Small overhead compared with DVS Transition time : 100x smaller Transition Energy : 1000x smaller More energy efficient at low voltage Clock frequency : 30% more efficient Large area overhead ~25% area overhead 18

Thank you!! 19

Real-Time Voltage Hopping (1/2) ACET: Average Case Execution Time WCET: Worst Case Execution Time Period Deadline Priority ACET1 (Energy) WCET1 ACET2 (Energy) WCET2 T1 50us 50us 1 5us (200nJ) 10us 10us (100nJ) 20us T2 80us 80us 2 10 us (400nJ) 20us 20 us (200nJ) 40us T3 100us 100us 3 15us (600nJ) 40us 30 us (300nJ) 80us Assumption : switches High speed Low speed CPU speed can be switched upon task Strategy : CPU speed is lowered only if any succeeding task does not violate its deadline 0 50 100 150 200 250 300 350 400 20

Real-Time Voltage Hopping (2/2) Conventional DVS consumes 6.46mJ (6.0mJ) 0 50 100 150 200 250 300 350 400 T1, T2, and T3 run twice, once and once in every 100us WCET is 80us Our MPP consumes 4.62mJ (23% reduction) 0 50 100 150 200 250 300 350 400 21