Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Similar documents
Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

Domain-specific Architectures for Emerging Data-Centric Workloads

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling

Massively Parallel Processor Architectures for Resource-aware Computing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Challenges of Heterogeneous MPSoC for Image Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Embedded Systems. 7. System Components

Measuring and Evaluating the Power Consumption and Performance Enhancement on Embedded Multiprocessor Architectures

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Multicycle-Path Challenges in Multi-Synchronous Systems

Multi processor systems with configurable hardware acceleration

Towards Optimal Custom Instruction Processors

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Invasive Computing for Robotic Vision

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Lecture 7: Introduction to Co-synthesis Algorithms

Overview of SOC Architecture design

Cost-and Power Optimized FPGA based System Integration: Methodologies and Integration of a Lo

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

Design methodology for multi processor systems design on regular platforms

The Xilinx XC6200 chip, the software tools and the board development tools

Hardware-Software Codesign. 1. Introduction

MPSoC Design Space Exploration Framework

Design of AMBA Based AHB2APB Bridge

Center Extreme Scale CS Research

Hardware-Software Codesign

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter

Part IV: 3D WiNoC Architectures

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

Trading hardware overhead for communication performance in mesh-type topologies

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

High performance, power-efficient DSPs based on the TI C64x

Bus AMBA. Advanced Microcontroller Bus Architecture (AMBA)

3D WiNoC Architectures

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Reconfigurable Computing. On-line communication strategies. Chapter 7

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Optimization of Behavioral IPs in Multi-Processor System-on- Chips

NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION SIMPLIFIES THE DISTRIBUTION OF I/O DATA THROUGHOUT THE CHIP, AND IS ALWAYS

Adaptable Intelligence The Next Computing Era

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

ARM Processors for Embedded Applications

Efficiently Scheduling Jobs to Asymmetric Multi-core Processors in Context of Performance and Thermal Budget using Autonomic Techniques

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Network on Chip Architecture: An Overview

Key technologies for many core architectures

Reconfigurable Computing. Design and Implementation. Chapter 4.1

Co-synthesis and Accelerator based Embedded System Design

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

Self-Adaptive FPGA-Based Image Processing Filters Using Approximate Arithmetics

System-Level Analysis of Network Interfaces for Hierarchical MPSoCs

Introduction to System-on-Chip

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Hardware/Software Co-design

Two-level Reconfigurable Architecture for High-Performance Signal Processing

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Image Processing on Heterogeneous Multiprocessor System-on-Chip using Resource-aware Programming

An FPGA Based Adaptive Viterbi Decoder

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders

THERMAL EXPLORATION AND SIGN-OFF ANALYSIS FOR ADVANCED 3D INTEGRATION

Long Term Trends for Embedded System Design

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Spiral 2-8. Cell Layout

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

P V Sriniwas Shastry et al, Int.J.Computer Technology & Applications,Vol 5 (1),

Area/Delay Estimation for Digital Signal Processor Cores

ERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing

VLSI Design of Multichannel AMBA AHB

Re-configurable VLIW processor for streaming data

The S6000 Family of Processors

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)

Embedded Systems: Hardware Components (part I) Todor Stefanov

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

Reconfigurable Cell Array for DSP Applications

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection

Neural Network based Energy-Efficient Fault Tolerant Architect

Coarse Grain Reconfigurable Arrays are Signal Processing Engines!

Higher Level Programming Abstractions for FPGAs using OpenCL

MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems

Processor Applications. The Processor Design Space. World s Cellular Subscribers. Nov. 12, 1997 Bob Brodersen (

A Process Model suitable for defining and programming MpSoCs

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre

Hardware/Software Codesign

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

OASIS Network-on-Chip Prototyping on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Transcription:

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann 2 1 Hardware/Software Co-Design, Department of Computer Science Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany 2 Institute for Electronic Design Automation Technische Universität München (TUM), Germany Workshop on Software and Compilers for Embedded Systems (SCOPES) St. Goar, Germany, June 1-3, 2015.

Motivation [Source: Tilera Inc.] [KIT s humanoid ARMAR] Here, we consider a resource-aware computing paradigm to exploit runtime adaptation without violating any thermal and/or power constraints in a programmable MPPA [Source: http://www.phys.ncku. edu.tw/~htsu/humor/fry_egg.html] 2 of 30

Invasive Computing A resource-aware computing paradigm Each application may use available computing resources in 3 phases : Exploring and claiming them (invade) Copying program code and data to the claimed resources as well as configuring them for parallel computing execution (infect) Releasing occupied resources (retreat) i-core Support for resource awareness at various levels Application level NoC Router Memory MPPA TCPA NoC Router Memory I/O NoC Router i-core Memory Runtime-system level Architecture level NoC Router NoC Router NoC Router Memory Memory TCPA MPPA Architecture consists of different compute tiles NoC Router Memory NoC Router NoC Router General purpose tiles General purpose s with reconfigurable fabrics Programmable accelerators (MPPA) 3 of 30

Outline Motivation Tightly Coupled Processor Arrays (TCPAs) Proposed System Experimental Results Conclusion 4 of 30

Tightly Coupled Processor Arrays (TCPAs) Class of massively parallel on-chip processor architectures Highly customizable at synthesis time Many parameters and configuration options E.g., number of PEs, data path width Flexibility at runtime Programmable, reconfigurable interconnect Used as accelerators in MPSoC or tiled architectures for computationally intensive tasks from domains such as Digital signal processing, audio and video Linear algebra (matrix/vector computations) Cryptography, 5 of 30

Write ports Read ports TCPAs cont d Processing elements (PEs) VLIW architecture Weakly programmable Small instruction set and memory Small register file No direct main memory access Memory structures Reconfigurable buffers at the borders of the array FIFOs at PEs inputs, and internal general purpose and feedback shift registers for cyclic data reuse Instruction Memory Input Ports ID0 MUX MUX ID0 MUX MUX General Purpose and Feedback Registers RD0 RD1 RD2 RD3 RD4 ALU ALU Type 1 PC Branch Unit f1 f0 regflags Output Ports DEMUX OD0OD1 Instruction Decoder FD0 FD1 FD2 FD3 FD4 6 of 30

TCPAs cont d Interconnect Switched connections Single-cycle latency Switching possibilities can be defined at synthesis time, which enables the configuration of different interconnect topologies at runtime Systolic Array 7 of 30

TCPAs cont d Interconnect Switched connections Single-cycle latency Switching possibilities can be defined at synthesis time, which enables the configuration of different interconnect topologies at runtime 2D Torus 8 of 30

TCPAs cont d Interconnect Switched connections Single-cycle latency Switching possibilities can be defined at synthesis time, which enables the configuration of different interconnect topologies at runtime 4D Hypercube 9 of 30

Designing a Customized TCPA TCPA Editor TCPA Params TCPA Configurations 10 of 30

Challenges How to estimate temperature without thermal sensors? How to estimate power consumption in large processor arrays with potentially hundreds of PEs? How to satisfy application requirements considering thermal and power constraints? Proposed Solution A runtime adaptation system that exploits quality/throughput tradeoffs while satisfying thermal and power constraints Our solution uses the invasive computing concepts. Thus, it is possible to achieve Constant throughput by reducing the quality of image processing (e.g., accuracy of edge detection), or Constant quality by reducing the throughput 11 of 30

Outline Motivation Tightly Coupled Processor Arrays (TCPAs) Proposed System Experimental Results Conclusion 12 of 30

temperature Thermal Estimation Without thermal sensors, we estimate the initial temperature inside the chip by generating the floorplan of TCPA using Cadence Encounter and a 45 nm CMOS technology The floorplan is the input for HotSpot and as output, we obtain the thermal conductance values between PEs and the average temperature value ( T initial ) of the entire processor array PE(0,0) PE(0,1) PE(0,2) PE(0,3) power PE(0,0) PE(0,1) PE(0,2) PE(0,3) High PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) Low ASIC implementation of a 4x4 TCPA using the NanGate 45 nm Open Cell Library Estimated thermal dissipation when only 4 PEs (on top of TCPA) are executing an application 13 of 30

Power Estimation Unlike gate-level simulation, we propose a macro-modeling for power estimation We synthesized different designs (e.g., varying number of functional units (FUs) in each PE, data path width, etc.) Then, we annotate the power consumption of each functional unit in a table Each processing element of TCPA may have a different configuration, our model takes into account individual PEs of a two-dimensional array of size X Y. The number of FUs belonging to a PE is denoted by N FUs The values of S are the average switching activities of FUs, which are obtained from a cycle-accurate simulation and vary according to each application P avg = X 1 Y 1 N FUs 1 P FUn (S x,y,n ) x=0 y=0 n=0 14 of 30

Configuration Manager The configuration manager holds configuration streams containing the assembly codes of PEs as well as the interconnect topology Configuration manager is mainly composed of three parts Hardware/software interface Configuration loader Configuration memory 15 of 30

Runtime Adaptation General assumptions Power budget can be dynamically updated in real-time by using a management of heterogeneous dark silicon processors, for example The runtime thermal variation is calculated according to a well-known thermal equation, as follows T = threshold + T initial threshold e g t T - Current temperature ( C) inside the accelerator threshold - Minimum (T min ) or maximum (T max ) temperature bounds for an application T initial - Initial temperature g - Thermal conductance between Pes t - Time 16 of 30

Avg. Power Temperature Runtime Adaptation Software Start yes (T<T max ) and (P avg < P budget ) no No runtime adaptation Our runtime adaptation invade All PEs T max Steady-state infect All PEs T min P budget Time yes (T T max ) or (P avg P budget ) no Time 17 of 30

Avg. Power Temperature Runtime Adaptation Software Start yes (T<T max ) and (P avg < P budget ) no No runtime adaptation Our runtime adaptation invade All PEs invade Less PEs T max Steady-state infect All PEs infect less PEs and retreat unsed PEs T min P budget Time yes (T T max ) or no yes (T T min ) and no (P avg P budget ) (P avg P budget ) Time 18 of 30

Avg. Power Temperature Runtime Adaptation Software Start yes (T<T max ) and (P avg < P budget ) no No runtime adaptation Our runtime adaptation invade All PEs invade Less PEs T max Steady-state infect All PEs infect less PEs and retreat unsed PEs T min P budget Time yes (T T max ) or no yes (T T min ) and no (P avg P budget ) (P avg P budget ) Time 19 of 30

Outline Motivation Tightly Coupled Processor Arrays (TCPAs) Proposed System Experimental Results Conclusion 20 of 30

Experimental Setup FPGA design Xilinx Virtex-5 FPGA The architecture consists of a RISC processor and a TCPA, both connected through the ARM Advanced Microcontroller Bus Architecture (AMBA) The TCPA architecture is composed of a 5 5 array of VLIW processing elements running at 60 MHz ASIC design Temperature and power consumption are estimated using the NanGate 45 nm technology Frequency: 550 MHz Target application Edge Detection Input frame size: VGA (640 480) Kernel Window Size Number of PEs Laplace 5 5 25 Sobel 1 3 3 21 of 30

APB bus Reconfig. Buffer Reconfig. Buffer Experimental Setup FPGA design Video stream Input/Output through DVI extension board Int. Ctrl. Conf. & Com. Proc. AHB/APB Bridge AHB bus Reconfig. Buffer Reconfig. Buffer 22 of 30

Execution Time Per Frame (ms) Level of Quality Experimental Results (FPGA Design) Laplace 5 5 Sobel 1 3 30 25 20 1x3 Sobel 5x5 Laplace 3 15 2 10 5 1 0 0 5 10 15 20 25 30 Number of PEs 0 23 of 30

Execution Time Per Frame (ms) Level of Quality Experimental Results (FPGA Design) Laplace 5 5 Sobel 1 3 30 25 20 1x3 Sobel 5x5 Laplace 3 15 2 10 5 1 0 0 5 10 15 20 25 30 Number of PEs 0 24 of 30

Execution Time Per Frame (ms) Level of Quality Experimental Results (FPGA Design) Laplace 5 5 Sobel 1 3 30 25 20 15 1x3 Sobel 5x5 Laplace Faster computation scenario providing 2 levels of quality and same throughput 3 2 10 5 1 0 0 5 10 15 20 25 30 Number of PEs 0 25 of 30

Temperature ( C) temperature Experimental Results (ASIC Design) ASIC Design Temperature variation inside the TCPA according to different number of PEs allocated for computation of 5 5 Laplace and 1 3 Sobel Temperature threshold T max = 70 C T min = 50 C Laplace 5 5 Sobel 1 3 90 No Runtime Adaptation Our Runtime Adaptation High 80 70 60 50 40 25 PEs (5x5) 3 PEs (1x3) 25 PEs (5x5) Time 3 PEs (1x3) 25 PEs (5x5) Low 26 of 30

Avg. Power (mw) Experimental Results (ASIC Design) Average power consumption of 5 5 Laplace and 1 3 Sobel using 25 and 3 PEs, respectively P avg (5 5 Laplace) = 10.35 mw P avg (1 3 Sobel) = 1.24 mw Reconfiguration overhead: 1.64 μs 15 Online Power Consumption Average 12 9 6 3 25 PEs (5x5) 3 PEs (1x3) 25 PEs (5x5) 3 PEs (1x3) 25 PEs (5x5) 0 Time 27 of 30

Outline Motivation Tightly Coupled Processor Arrays (TCPAs) Proposed System Experimental Results Conclusion 28 of 30

Conclusion We presented a strategy for runtime adaptation of application execution under thermal and power constraints in massively parallel processor arrays (MPPAs) Unlike [1], we exploit quality/throughput tradeoffs in image processing, keeping (a) constant throughput by reducing the quality, or (b) constant quality by reducing the throughput Running at 550 MHz, the reconfiguration overhead (1.64 μs) is more than 300 times faster than the execution time per frame (0.56 ms) The maximum power consumption of 25 PEs is 10.35 mw [1] C. Tan, T. S. Muthukaruppan, T. Mitra, and L. Ju. Approximation-aware scheduling on Heterogeneous multicore architectures. In: 20th Asia and South Pacific Design Automation Conference (ASP-DAC). 2015, pp. 618 623. 29 of 30

Further Information Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Contact Éricles Sousa Hardware/Software Co-Design, Department of CS University of Erlangen-Nuremberg Cauerstr. 11, 91058 Erlangen, Germany Email: sousa@cs.fau.de Phone: + 49 9131 85-67312 Acknowledgements This work is supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre Invasive Computing (SFB/TR 89) Research Training Group 1773 "Heterogeneous Image Systems" Brazilian National Council for Scientific and Technological Development (CNPq) www.invasive-computing.org 30 of 30