On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

Similar documents
Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Solving the System-Level Design Riddle. October 2014

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

IMPROVES. Initial Investment is Low Compared to SoC Performance and Cost Benefits

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Ncore Cache Coherent Interconnect

Building blocks for 64-bit Systems Development of System IP in ARM

SoC Communication Complexity Problem

Adaptive Voltage Scaling (AVS) Alex Vainberg October 13, 2010

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Freescale i.mx6 Architecture

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Zynq-7000 All Programmable SoC Product Overview

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Yafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces

Heterogeneous, Distributed and Scalable Cache-Coherent Interconnect

Low-Power Technology for Image-Processing LSIs

The Challenges of System Design. Raising Performance and Reducing Power Consumption

ARM big.little Technology Unleashed An Improved User Experience Delivered

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Attack Your SoC Power Challenges with Virtual Prototyping

Combining Arm & RISC-V in Heterogeneous Designs

ECE 486/586. Computer Architecture. Lecture # 2

Embedded Systems: Architecture

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

Design Techniques for Implementing an 800MHz ARM v5 Core for Foundry-Based SoC Integration. Faraday Technology Corp.

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Validation Strategies with pre-silicon platforms

ECE 571 Advanced Microprocessor-Based Design Lecture 24

Hardware Software Bring-Up Solutions for ARM v7/v8-based Designs. August 2015

Multi-Core Microprocessor Chips: Motivation & Challenges

Introduction to ASIC Design

Place Your Logo Here. K. Charles Janac

SpiNNaker - a million core ARM-powered neural HPC

Each Milliwatt Matters

Toward a Memory-centric Architecture

Multimedia in Mobile Phones. Architectures and Trends Lund

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

AT-501 Cortex-A5 System On Module Product Brief

Embedded Linux Conference San Diego 2016

ARM instruction sets and CPUs for wide-ranging applications

Power Aware Architecture Design for Multicore SoCs

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

ARM the Company ARM the Research Collaborator

Does FPGA-based prototyping really have to be this difficult?

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

Ultra Low Power GPUs for Wearables

Seahawk Power-optimized implementation of High Performance Quad-core Cortex-A15 Processor

Chapter 5: ASICs Vs. PLDs

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface

Mobile & IoT Market Trends and Memory Requirements

Next Generation Enterprise Solutions from ARM

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Maximizing heterogeneous system performance with ARM interconnect and CCIX

High-Speed NAND Flash

VLSI Design Automation. Maurizio Palesi

ECE 471 Embedded Systems Lecture 3

Software Driven Verification at SoC Level. Perspec System Verifier Overview

A 400Gbps Multi-Core Network Processor

Hardware-Software Codesign. 1. Introduction

The Rubber Jigsaw Puzzle

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Building supercomputers from embedded technologies

ARM Security Solutions and Numonyx Authenticated Flash

Product Technical Brief S3C2416 May 2008

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

MediaTek CorePilot. Heterogeneous Multi-Processing Technology. Delivering extreme compute performance with maximum power efficiency

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Product Technical Brief S3C2413 Rev 2.2, Apr. 2006

Mobile & IoT Market Trends and Memory Requirements

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Chapter 5. Introduction ARM Cortex series

PowerAware RTL Verification of USB 3.0 IPs by Gayathri SN and Badrinath Ramachandra, L&T Technology Services Limited

Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus

A Study on C-group controlled big.little Architecture

08 - Address Generator Unit (AGU)

Designing Security & Trust into Connected Devices

An FPGA Architecture Supporting Dynamically-Controlled Power Gating

Achieving UFS Host Throughput For System Performance

VLSI Design Automation

Accelerating Innovation

Kontron s ARM-based COM solutions and software services

The CoreConnect Bus Architecture

Mobile & IoT Market Trends and Memory Requirements

SoC Designer. Fast Models System Creator Cycle Models Reference. Version 9.2. Copyright 2017 ARM Limited. All rights reserved.

FPGA Adaptive Software Debug and Performance Analysis

Outline Marquette University

DFT Trends in the More than Moore Era. Stephen Pateras Mentor Graphics

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

Copyright 2016 Xilinx

Transcription:

On-chip Networks Enable the Dark Silicon Advantage Drew Wingard CTO & Co-founder Sonics, Inc.

Agenda Sonics history and corporate summary Power challenges in advanced SoCs General power management techniques On-chip network features and benefits Optimizing dark silicon with on-chip networks Future work 2

Sonics Leader in System IP for SoCs Sonics enables designers to integrate any IP from anywhere, anytime Easy IP re-use Connecting third party IP / subsystems Total system approach: Intelligent memory scheduling Optimal power-aware designs Data flow services: QoS, Security firewalls World-class engineering team Largest team of on-chip network engineers Strong local presence in Japan Commanding presence in digital entertainment, mobile and wireless 8 of top 10 semi SoC companies Results: 2 Billion units shipped Over 200 design completions 3

ARM and Sonics ARM and Sonics have been working together to mutually support SoC customers for more than 10 years Multiple generation of ARM s flagship CPUs for Application Processors Multiple generations of AMBA Sonics fully supports ARM SoC initiatives AMBA, TrustZone, etc. Recently announced expanded partnership focused on enhanced interoperability and power management Plus a patent licensing arrangement 4

How is Your Current SoC Project Going? Are you hitting your performance targets? Did you achieve the frequency you hoped for? Are you staying within your power budgets? Did you see your throughput decrease as frequency increased? Did timing issues at layout force you to re-work your architecture? 5

Common Architecture for Over 16 Years A common on-chip network architecture Structure: IP core sockets, isolated from network fabric by intelligent agents Sockets: AMBA ACE, 3/4, AHB, APB, OCP 1/2/3 Protocols: completely non-blocking multi-threaded fabrics Features: End-to-end QoS, security, error and power management, etc. Software: consistent register-level views Development environment: unified SonicsStudio tools enables a family of micro-architectures SonicsGN: highly scalable multi-domain router-based fabric at up to 2 GHz SonicsSX: low latency cascaded cross-bar fabric Sonics3220: efficient sharing of many peripherals spread across SoC and supporting System IP MemMax scheduler: delivering highest DRAM throughput and QoS 6

533MHz Example: Tablet Application Processor Cortex A15 x 4 CPU CPU 1333MHz 1066MHz 533MHz CPU CPU L2 Cache Cortex A7 x 4 CPU CPU CPU CPU L2 Cache Mali-T658 Quad core GPU GPU GPU GPU Power Domains CoreLink CCI-400 Coherency Fabric 133MHz ROM 267MHz Security 533MHz SRAM 267MHz LCD Controller 200MHz Cam 1 Secure ROM DMA HDMI Video Codec Cam 2 133MHz 267MHz 133MHz 267MHz 200MHz SonicsGN On-chip Network 1066MHz 533MHz Sonics MemMax Memory Scheduler 533MHz Sonics MemMax Memory Scheduler 133MHz Ethernet PCIe 400MHz Audio 133MHz SATA DRAM Cont. DRAM Cont. 267MHz Sonics3220 Peripheral Network 1066MHz 1066MHz 7 133MHz USB APB Peripherals 133MHz

Agenda Sonics history and corporate summary Power challenges in advanced SoCs General power management techniques On-chip network features and benefits Optimizing dark silicon with on-chip networks Future work 8

Market Survey: Increasing SoC Complexity Design complexity increasing Power/Performance/Area remain key challenge Complexity driven Frequency broad range of implementation points 51% need > 1GHz Multiple power domains Better battery life Coping with Dark Silicon Domains often tied to key subsystems Source: Sonics conducted survey during October 2012, with 318 responses 9

Power Consumption is a Major Concern Battery-powered devices Battery life is a key selling feature Battery size impacts weight, pocket-ability, hand-fit, etc Line-powered devices need to be concerned with power, too Power consumption impacts cost of packaging Power supply may be limited (e.g. PoE, Energy Star, EU Energy Label) Cooling issues No new SoC development can afford to ignore power consumption 10

The Dark Silicon Challenge Moore s Law enables integration of massive functionality on SoC More than 1 billion transistors at 28nm But leakage current limits how many transistors can be powered Multiple threshold voltages, dynamic voltage control helps The result: Dark Silicon the imperative to dynamically manage which parts of the SoC are powered Many people believe that Dark Silicon is a problem Sonics believes that it is an opportunity to re-think how we partition SoCs to better exploit performance while minimizing power/energy 11

Agenda Sonics history and corporate summary Power challenges in advanced SoCs General power management techniques On-chip network features and benefits Optimizing dark silicon with on-chip networks Future work 12

difficulty Power Management Techniques General techniques Clock gating Stop/start subsystem clocks Dynamic clock frequency On/off voltage domains Dynamic voltage/frequency domains (DVFS) IP-specific techniques ARM big.little (use optimum IP for loading) Power managers implement the techniques Software: flexible, but slow Hardware: very responsive, but less flexible 13

Reducing Clock Power Reduce the clock frequency when possible Stop the clock when nothing useful to be done To get the best result, this needs to be architected into the IP Prefer a hierarchical approach - Fine-grain clock gating At a register or state machine level, when there is nothing useful to do, stop the clock. - Toggling just 1 clock gate instead of n loads, where n = number of local flops Clock gate 14

Reducing Clock Power To get the best result, this needs to be architected into the IP Prefer a hierarchical approach - Fine-grain clock gating At a register or state machine level, when there is nothing useful to do, stop the clock. - Toggling just 1 clock gate instead of n loads, where n = number of local flops - Course grain clock gating At a component level, when all internal clock gates block the clock, then gate the clock to the component. course grain clock gate Reduce the clock frequency when possible Stop the clock when nothing useful to be done - Toggling just 1 load instead of m loads, where m = number of fine-grain clock gates. 15

Relative Power Measured Benefits of SGN Clock Gating vs. Conventional 1 0.8 Sonics-provided Fine Gating and Idle Detection Synthesis Gating + Sonics Idle Detection Synthesis Gating Only 0.6 0.4 Automatic idle detection 0.2 16 0 0% 25% 50% 75% 100% Relative Throughput

Reducing Clock Power Reduce the clock frequency when possible Stop the clock when nothing useful to be done To get the best result, this needs to be architected into the IP Prefer a hierarchical approach - Fine-grain clock gating At a register or state machine level, when there is nothing useful to do, stop the clock. - Toggling just 1 clock gate instead of n loads, where n = number of local flops - Course grain clock gating At a component level, when all internal clock gates block the clock, then gate the clock to the component. - Toggling just 1 load instead of m loads, where m = number of fine-grain clock gates. This approach allows extremely effective clock gating Typical Sonics designs achieve > 99.5% clock gating, many > 99.9% For example: 16 free running flops in a network with >40K flops (99.96%) 17

Reducing Voltage-related Power Reduce or remove the voltage when possible Partition the design into multiple power domains Reduced voltage can save significant dynamic power: P=C*V 2* f V1 OFF V5 V4 V2 V3 V2 OFF V5 V3 V1 V3 Switching off the voltage saves even more: leakage=0 18

Reducing Voltage-related Power Reduce or remove the voltage when possible Partition the design into multiple power domains Reduced voltage can save significant dynamic power: P=C*V 2* f V1 V2 V3 Switching off the voltage saves even more: leakage=0 Especially effective when large parts of the SoC can be switched off 19

Reducing Voltage-related Power Reduce or remove the voltage when possible Partition the design into multiple power domains Reduced voltage can save significant dynamic power: P=C*V 2* f A15 OFF V1 A7 OFF V2 V3 Switching off the voltage saves even more: leakage=0 Especially effective when large parts of the SoC can be switched off ARM big.little is a good example! 20

Challenge: Enabling Power Domains for the SoC With standard fabrics, the natural choice is to create boundaries at the bus interface The bus must be powered if any of the attached cores are powered - Forces bus into an always-on portion of the SoC, or - Requires partitioning fabric at power domain boundaries, complicating design Requires some kind of domain crossing at the bus interface - Which may have MANY wires I I I I I T T T T T 21

Agenda Sonics history and corporate summary Power challenges in advanced SoCs General power management techniques On-chip network features and benefits Optimizing dark silicon with on-chip networks Future work 22

533MHz Efficient IP Integration Universal connectivity: AMBA (, 4/ACE, AHB, APB ), OCP, PIF and proprietary cores Serialized router-based network: Reduced wire count up to 1/16 HDMI 4 64-Pins 16-Pins Tablet SoC 1333MHz 1333MHz 533MHz Cortex A15 Cortex A7 CoreLink CCI-400 Mali GPU LCD HDMI Video Video Encode Cam Audio 4 OCP SonicsGN On-chip Network DRAM DRAM SRAM ROM PCle Enet SATA USB 23

533MHz High Performance Universal connectivity: AMBA (, 4/ACE, AHB, APB), OCP, PIF and proprietary cores Serialized router-based network: Reduced wire count up to 1/16 High speed: 2GHz Tablet SoC 1333MHz 1333MHz 533MHz Cortex A15 Cortex A7 Mali GPU LCD HDMI Video Video Encode Cam Audio CoreLink CCI-400 4 OCP 2GHz Fabric Speed SonicsGN On-chip Network DRAM DRAM SRAM ROM PCle Enet SATA USB 24

533MHz Highest Bandwidth Universal connectivity: AMBA (, 4/ACE, AHB, APB), OCP, PIF and proprietary cores Serialized router-based network: Reduced wire count up to 1/16 High speed: 2GHz Virtual Channels for efficient link sharing Shared Link Fewer wires Up to 16 Channels Tablet SoC 1333MHz 1333MHz 533MHz Cortex A15 Cortex A7 Mali GPU LCD HDMI Video Video Encode Cam Audio CoreLink CCI-400 4 OCP SonicsGN On-chip Network DRAM DRAM SRAM ROM PCle Enet SATA USB 25

533MHz Security Universal connectivity: AMBA (, 4/ACE, AHB, APB), OCP, PIF and proprietary cores Serialized router-based network: Reduced wire count up to 1/16 High speed: 2GHz Virtual Channels for efficient link sharing Firewalls: Flexible security domains: TrustZone capable Tablet SoC 1333MHz 1333MHz 533MHz Cortex A15 Cortex A7 CoreLink CCI-400 Mali GPU LCD HDMI 4 Video OCP Video Encode Firewall at any Target Cam Audio SonicsGN On-chip Network DRAM DRAM SRAM ROM PCle Enet SATA USB 26

Agenda Sonics history and corporate summary Power challenges in advanced SoCs General power management techniques On-chip network features and benefits Optimizing dark silicon with on-chip networks Future work 27

Challenge: Enabling Power Domains for the SoC With standard fabrics, the natural choice is to create boundaries at the bus interface The bus/cross-bar must be powered if any of the attached cores are on - Forces fabric into an always-on portion of the SoC, or - Requires partitioning fabric at power domain boundaries, complicating design Requires some kind of domain crossing at the bus interface - Which may have MANY wires I I I I I T T T T T 28

Using the Network to Enable Power Domains Could use a bus-style approach Place power boundaries at IP sockets This approach leaves power on the table I I I I I T T T T T 29

Using the Network to Enable Power Domains No need to power the agent (network interface) when the IP core is off I I I I I Always on or off together? T T T T T 30

Using the Network to Enable Power Domains No need to power the agent (network interface) when the IP core is off Network components can be partitioned inside power domains! I I I I I T T T T T 31

Safe Operation with Powered Down Domains Initiator agent clears path to target to enable safe shutdown of power domain Initiator agent returns errors on access to powered-off domains Initiator agent knows power state of each domain along its routing paths Initiator Agent I I I I I T T T T T 32

Network can Automatically Wake-up Components Initiator agent knows which components need to wake up 1. Hold traffic 2. Send a request to the system power manager 3. Receive response 4. Release traffic I I I I I Power Manager T T T T T 33

Tablet SoC Design Example Power Aware On-Chip Network Domain partitioning Clock gating Domain on/off control Tablet SoC Domain 1 Domain 2 Domain 3 Domain 4 Subdom 1 Subdom 2 Subdom 3 Cortex A15 Cortex A7 Mali GPU LCD HDMI Video Video Encode Cam Audio SonicsGN On-chip Network CoreLink CCI-400 Domain 5 Domain 6 Domain 7 DRAM Contrl. DRAM Contrl. SRAM ROM PCle Enet SATA USB Temp. Sensor PMIC I/F 34

Network Power Management Unlimited number of domains: Power, Voltage, Frequency Domains can cross anywhere in the network Synchronous, Asynchronous, Mesochronous crossing Domain 1 Domain 2 Domain 3 Domain 4 Subdom 1 Subdom 2 Subdom 3 Cortex A15 Cortex A7 Mali GPU LCD HDMI Video Video Encode Cam Audio SonicsGN On-chip Network CoreLink CCI-400 Domain 5 Domain 6 Domain 7 DRAM Contrl. DRAM Contrl. SRAM ROM PCle Enet SATA USB 35

Domain Power Manager Network Power Management Unlimited number of domains: Power, Voltage, Frequency Domains can cross anywhere in the network Synchronous, Asynchronous, Mesochronous crossing Power bundle at all domains Fast wake and shutdown Auto wake Power Down Req Power Down Ack Auto Wake Enable Auto Wake Reg Domain 1 Domain 2 Domain 3 Domain 4 Subdom 1 Subdom 2 Subdom 3 Cortex A15 Cortex A7 Mali GPU LCD HDMI Video Video Encode Cam Audio SonicsGN On-chip Network CoreLink CCI-400 Domain 5 Domain 6 Domain 7 DRAM Contrl. DRAM Contrl. SRAM ROM PCle Enet SATA USB 36

Agenda Sonics history and corporate summary Power challenges in advanced SoCs General power management techniques On-chip network features and benefits Optimizing dark silicon with on-chip networks Future work 37

Concept: Power Manager IP to Leverage Network Highly Integrated, Power Aware On-Chip Network + On-Chip Power Manager + Integrated Tool Chain = Automated, Fine Grained, Highly Responsive Dark Silicon Solutions Tablet SoC Domain 1 Domain 2 Domain 3 Domain 4 Subdom 1 Subdom 2 Subdom 3 CPU1 CPU2 GPU LCD HDMI Video Video Encode Cam Audio Coherency Fabric SonicsGN On-chip Network Domain 5 Domain 6 Domain 7 DRAM Contrl. DRAM Contrl. SRAM ROM PCle Enet SATA USB Temp. Sensor PMIC I/F ucontroller Future Power Manager 38 March 2013 2013, Sonics, Inc. Proprietary NDA Required

Power Power Integrated Power Management Benefits Complete power management solution: Advanced on-chip network System power manager: hardware and software Advanced tooling environment Wake up CPU to switch power state Conventional Enables much finer grained power control Fast & safe transition to lower power states Power on just in time (auto wake-up) Much less CPU overhead Keep CPU powered off more Avoid lots of context switches Power Savings Time Hardwarecontrolled switching Power Savings Future Power Manager Power Savings Earlier completion Power Savings Time 39

Future Power Management Benefits Sonics: Uniquely positioned to provide advanced SoC power management Capability On-chip network that spans arbitrary collections of power domains Power/voltage/clock domain aware onchip network with power management interface Auto-wake algorithm Integrate network capture and performance analysis tools Automated support for domain partitioning Automated correct-by-construction approach Benefit Easily implement many domains Supports late/iterative partitioning choices Safe and fast hardware-controlled shutdown Auto-wakeup signals to power manager Ensures minimum ON time Minimize leakage and idle power Reduced time and effort Reduced time and effort Supports many more domains without TTM and verification risks Can save HALF of total SoC power consumption! 40

THANK YOU 41

Managing Power with SonicsGN Flexible power domain support Asynch/mesochronous Isolation/level shifters HW-controlled safe shutdown Automatic wakeup Benefits: More domains Quicker shutdown Faster wakeup Keep more dark, more of the time DDR3 2133 DDR3 2133 133 MHz 133 MHz 533 MHz 533 MHz 533 MHz 133 MHz On-die SRAM DRAM Ch. 1 DRAM Ch. 2 On-die ROM IP Control Peripherals S S S S S S 128 128 128 64 32 32 T T T T T T 533 MHz 1333 MHz 1066 MHz 533 MHz Cortex- A15 Cluster M 128 I A 2x2 B 2x3 D 1x3 Cortex- A7 Cluster Mali- T658 Cluster CCI-400 M S 128 128 E 4x1 C 2x3 Display Ctrl. M 32 H 5x2 HDMI M 64 Video Video Engine Encode M M 32 64 I T I I I I SonicsGN Request Network 267 MHz 133 MHz I 4x1 267 MHz 267 MHz F 4x1 J 3x1 G 4x1 I I I I I I I 64 64 64 64 64 64 64 M M M M M M M Cam 1 Cam 2 Audio USB 1 USB 2 USB 3 USB OTG 200 MHz 200 MHz 400 MHz 133 MHz 133 MHz 133 MHz 133 MHz T I I I I I I I I 32 32 32 64 S M M PCIe E-net 32 M Security Engine 267 MHz 133 MHz 267 MHz 64 64 64 64 M M M M SD/ M DMA SATA UFS CF/ HSI MMC 267 MHz 133 MHz 133 MHz 133 MHz 133 MHz Power Domain Boundary 42 50% SoC Power Reduction!

Reducing power consumption Engineers have developed many power saving techniques Reduce the clock frequency when possible Stop the clock if nothing useful to be done Reduce the voltage when possible (P=CV 2 F) Remove (switch) the voltage in many cases Develop islands of (frequency, voltage, switched power) Part of the SoC may need to be running full-speed While other portions can be slowed, stopped, or switched off How do these techniques affect the creation and use of IP cores? How do these techniques affect the SoC infrastructure? 43