Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus

Similar documents
Age nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Place Your Logo Here. K. Charles Janac

ASYNC Rik van de Wiel COO Handshake Solutions

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

The CoreConnect Bus Architecture

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Effective System Design with ARM System IP

IMPROVES. Initial Investment is Low Compared to SoC Performance and Cost Benefits

L2: Design Representations

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

SoC Communication Complexity Problem

101-1 Under-Graduate Project Digital IC Design Flow

The RM9150 and the Fast Device Bus High Speed Interconnect

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.

Effective Verification of ARM SoCs

Digital Signal Processor Core Technology

The S6000 Family of Processors

The Challenges of System Design. Raising Performance and Reducing Power Consumption

SONICS, INC. Sonics SOC Integration Architecture. Drew Wingard. (Systems-ON-ICS)

FlexRay The Hardware View

Intellectual Property Macrocell for. SpaceWire Interface. Compliant with AMBA-APB Bus

Fujitsu SOC Fujitsu Microelectronics America, Inc.

AT-501 Cortex-A5 System On Module Product Brief

ECE 551 System on Chip Design

OCB-Based SoC Integration

Transaction level modeling of SoC with SystemC 2.0

Chapter 6 Storage and Other I/O Topics

Product Technical Brief S3C2440X Series Rev 2.0, Oct. 2003

Achieving UFS Host Throughput For System Performance

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Creating hybrid FPGA/virtual platform prototypes

Ten Reasons to Optimize a Processor

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

The Design and Implementation of a Low-Latency On-Chip Network

Five Key Steps to High-Speed NAND Flash Performance and Reliability

Choosing an Intellectual Property Core

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

CS152 Computer Architecture and Engineering Lecture 20: Busses and OS s Responsibilities. Recap: IO Benchmarks and I/O Devices

More on IO: The Universal Serial Bus (USB)

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

Embedded Systems: Architecture

Fujitsu System Applications Support. Fujitsu Microelectronics America, Inc. 02/02

Processor and Peripheral IP Cores for Microcontrollers in Embedded Space Applications

Oberon M2M IoT Platform. JAN 2016

Marvell PXA3xx (88AP3xx) Processor Family

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Product specification

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

Product Technical Brief S3C2416 May 2008

Embedded Systems. 8. Communication

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

KeyStone C665x Multicore SoC

System Level Design For Low Power. Yard. Doç. Dr. Berna Örs Yalçın

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010

Simplify System Complexity

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University

100M Gate Designs in FPGAs

Chapter Seven Morgan Kaufmann Publishers

The Need for Speed: Understanding design factors that make multicore parallel simulations efficient

V8uC: Sparc V8 micro-controller derived from LEON2-FT

Hardware Software Bring-Up Solutions for ARM v7/v8-based Designs. August 2015

Module 6: INPUT - OUTPUT (I/O)

Buses. Maurizio Palesi. Maurizio Palesi 1

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Hardware/Software Partitioning for SoCs. EECE Advanced Topics in VLSI Design Spring 2009 Brad Quinton

The Design of MCU's Communication Interface

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Functional Verification of xhci (extensible host controller Interface) for USB 3.1 Using HDL

Digital Blocks Semiconductor IP

An Asynchronous NoC Router in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart

Design of Embedded Hardware and Firmware

DIGITAL DESIGN TECHNOLOGY & TECHNIQUES

Chapter 2 Designing Crossbar Based Systems

Design of AMBA Based AHB2APB Bridge

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

Digital Blocks Semiconductor IP

System-level simulation (HW/SW co-simulation) Outline. EE290A: Design of Embedded System ASV/LL 9/10

TIMA Lab. Research Reports

Test and Verification Solutions. ARM Based SOC Design and Verification

Module Introduction. CONTENT: - 8 pages - 1 question. LEARNING TIME: - 15 minutes

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Digital Design Methodology

Eliminating Routing Congestion Issues with Logic Synthesis

08 - Address Generator Unit (AGU)

Universal Serial Bus Host Interface on an FPGA

ARM Processors for Embedded Applications

Prefetch Cache Module

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

ADPCM-LCO Voice Compression Logic Core

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Contents 1 Introduction 2 Functional Verification: Challenges and Solutions 3 SystemVerilog Paradigm 4 UVM (Universal Verification Methodology)

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

Transcription:

Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus Andrew M. Scott, Mark E. Schuelein, Marly Roncken, Jin-Jer Hwan John Bainbridge, John R. Mawer, David L. Jackson, Andrew Bardsley 1 Slide 1

Introduction 2 Slide 2

Intel PXA27x Processor Design General Purpose I/O (GPIO) RTC OS Timers 4x PWM Interrupt 3x SSP USIM I 2 S AC 97 Std UART Full UART Bluetooth UART Fast Infrared I 2 C USB Client BB Interface Keypad Interface MMC/SC/SDIO Memory Stick USB On-The-Go 32.7 khz 32.768 khz / 13 MHz 13 / 26 MHz 32.768 khz / 13 MHz /104 MHz 13 MHz / 26 MHz 48 MHz 2 to 12 MHz 12.288 MHz 14.7 MHz 14.7 MHz 14.7 MHz 48 MHz 33.3 MHz 48 MHz 48 MHz 32 khz 20 / 25 MHz to 20 MHz 48-65 MHz Peripheral Bus (PB) 13 MHz / 26 MHz DMA Controller & Bridge Quick Capture Interface Intel Wireless MMX TM Power Management / Clock Control Internal SRAM LCD Controller System Bus 104/133/208 MHz Intel Xscale Core Debug Controller 32.766 khz Osc USB Host Controller 13 MHz Osc Memory Controller Address & Data Variable Latency I/O Control PC Card / CompactFlash Control Dynamic Memory Control Static Memory Control Slide 3

Intel PXA27x Processor Design General Purpose I/O (GPIO) RTC OS Timers 4x PWM Interrupt 3x SSP USIM I 2 S AC 97 Std UART Full UART Bluetooth UART Fast Infrared I 2 C USB Client BB Interface Keypad Interface MMC/SC/SDIO Memory Stick USB On-The-Go 32.7 khz 32.768 khz / 13 MHz 13 / 26 MHz 32.768 khz / 13 MHz /104 MHz 13 MHz / 26 MHz 48 MHz 2 to 12 MHz 12.288 MHz 14.7 MHz 14.7 MHz 14.7 MHz 48 MHz 33.3 MHz 48 MHz 48 MHz 32 khz 20 / 25 MHz to 20 MHz 48-65 MHz Peripheral Bus (PB) 13 MHz / 26 MHz DMA Controller & Bridge Quick Capture Interface Intel Wireless MMX TM Power Management / Clock Control Internal SRAM LCD Controller System Bus 104/133/208 MHz Intel Xscale Core Debug Controller 32.766 khz Osc USB Host Controller 13 MHz Osc Memory Controller Address & Data Variable Latency I/O Control PC Card / CompactFlash Control Dynamic Memory Control Static Memory Control Slide 4

Async Peripheral Bus Team General Purpose I/O (GPIO) RTC OS Timers 4x PWM Interrupt 3x SSP USIM I 2 S AC 97 Std UART Full UART Bluetooth UART Fast Infrared I 2 C USB Client BB Interface Keypad Interface MMC/SC/SDIO Memory Stick USB On-The-Go 32.7 khz 32.768 khz / 13 MHz 13 / 26 MHz 32.768 khz / 13 MHz /104 MHz 13 MHz / 26 MHz 48 MHz 2 to 12 MHz 12.288 MHz 14.7 MHz 14.7 MHz 14.7 MHz 48 MHz 33.3 MHz 48 MHz 48 MHz 32 khz 20 / 25 MHz to 20 MHz 48-65 MHz Peripheral Bus (PB) 13 MHz / 26 MHz DMA Controller & Bridge Quick Capture Interface Intel Wireless MMX TM Power Management / Clock Control SoC Flow Development Internal SRAM LCD Controller System Bus 104/133/208 MHz Intel Xscale Core Debug Controller Andy & Jin-Jer 32.766 khz Osc USB Host Controller 13 MHz Osc Memory Controller Address & Data Extreme Low-Power product Design Mark & Mark Fullerton Asynchronous Tools Marly & Andrew Variable Latency I/O Control PC Card / CompactFlash Control Dynamic Memory Control Asynchronous Fabrics John John & Dave Static Memory Control Slide 5

Objectives Build an Asynchronous NoC in a Synchronous SoC flow Assess design tradeoffs from Product Developer s Perspective Identify gaps in current design capabilities Slide 6

What do SoC Developers worry about? Inflexible product-introduction cycles Shorter product lead times & product life times Growing Complexity Design & Manufacturing rules Product & System design Slide 7

What do SoC Developers want from their flow? Fast integration and validation of IP Minimal IP and IP-collateral redesign Modular Design Flows Minimal disruption to their synchronous SoC flow Slide 8

Exploration - What we did 9 Slide 9

Peripheral Bus Baseline Design General Purpose I/O (GPIO) RTC OS Timers 4x PWM Interrupt 3x SSP SSP USIM I 2 S AC 97 Std UART Full UART Bluetooth UART Fast Infrared I 2 C USB Client BB Interface Keypad Interface 32.7 khz 32.768 khz / 13 MHz 13 / 26 MHz 32.768 khz / 13 MHz /104 MHz 13 MHz / 26 MHz 48 MHz 2 to 12 MHz 12.288 MHz 14.7 MHz 14.7 MHz 14.7 MHz 48 MHz 33.3 MHz 48 MHz 48 MHz 32 khz Peripheral Bus (PB) 13 MHz / 26 MHz DMA Controller & Bridge 3 representative Bus Slaves: SSP UART Baseband Peripheral Bus Fabric Bus Master DMA Controller MMC/SC/SDIO Memory Stick USB On-The-Go 20 / 25 MHz to 20 MHz 48-65 MHz Slide 10

Peripheral Bus Synchronous Interface SSP UART Std UART DMAC BB BB Interface PB clock domain SSP UART BB DMAC clock domains synchronizer Slide 11

Peripheral Bus Asynchronous Interface SSP UART Std UART Synchronizing Synchronizing 3 Async Interface Adaptations Synchronizing Adapter no Master/Slave redesign simplest, adds synchronizers DMAC BB Interface Asynchronous clock Pausible Clock Adapter no Master/Slave redesign locally generated interface clock no extra synchronizers Pausible Clock Asynchronous Interface requires redesign (UART) removes synchronizers to PB Slide 12

Transaction Level Testing 13 Slide 13

Transaction-Level Testing (TLT) Test scope Functional coverage Stress & Error Conditions Multiple Use Models & Traffic Scenarios Peripheral, Subsystem, and System Level How? Specify Transactions Automatic Protocol Adherence and Results Checking Strengths Test Re-use at Peripheral, Subsystem & System level Test Re-use for Synchronous & Asynchronous Facilitated abstraction to higher-level traffic patterns AND HENCE: Highly portable & powerful!!! Slide 14

EDA Flow & Network Construction 15 Slide 15

Scope Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Build 5 representative Synchronous & Asynchronous top-level networks Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Evaluate Functionality Timing & Realistic Power Metrics Assess Asynchronous Design & EDA Flow integration issues Slide 16

Silístix Design Entry & Asynchronous Network Construction Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Enter high-level description of self-timed NoC topology Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Generate hierarchical, structural Verilog netlists Modify UART PB-facing logic to attach directly to the asynchronous fabric Slide 17

Intel Design Entry & Network Construction Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Typical Low-Power SoC flow uses commercial EDA tools used for wide product & process range (180, 130, 90nm etc.) Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Slide 18

Intel Design Entry & Network Construction Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Synthesis & Netlist Integration Intel Stdcell / SRAM libraries Intel Design Entry Intel PXA27x IP, text Our Usage Model: Synthesize synchronous blocks at 2 PVT corners 1M-gate Wire-load model to match original 27-peripheral PB No clock-gating, scan-insertion RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Import Asynchronous blocks Stitch top-level networks Slide 19

Evaluation Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Validate SoC flow usage Gate-to-gate FV For key synchronous blocks Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Slide 20

Evaluation Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Dynamic Simulation Functionality, Timing & Power Unit-delay models Back-annotation, 2 PVT corners Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Typical PB Traffic scenarios 0.5MB/s (PB idle) 1MB/s (PB Normal) 10MB/s (PB max) Netlist-based Metric collection Slide 21

Top-Level Networks Small test cases for debug: sync_1i3t, async_1i3t 1x (UART-sync, BB, SSP), 1x (UART-async, BB, SSP) Primary test cases for Async-Sync comparisons: async_127t 9x (UART-async, BB, SSP) UART-sync + Synchronizing Adapter substituted for Metrics sync_1i27t 9x (UART-sync, BB, SSP) + up-scaled PB_MUX Extra test case to check scaling properties: async_1i30t 10x (UART-async, BB, SSP) Slide 22

Results 23 Slide 23

Active Power x Traffic: Async Fabric Active Power (uw) 200 150 100 50 0 94% async_1i3t async_1i27t async_1i30t 0.5 M B /s 6 11 11 1 M B / s 13 26 26 10 M B / s 10 1 19 7 19 7 Async Fabric Power SCALES with traffic Slide 24

Active Power x Traffic: UART Active Power (uw) 800 600 400 200 0 0.5 MB/s 1 MB/s 10 MB/s UA RT-sync 848 855 892 UA RT-async 242 253 248 Reduction 71% 70% 72% 70% Lower Power for Async Redesign NOTE Async power scaling not visible for given TLT Slide 25

Active Power x Data: async_1i27t Active Power (mw) 7 6 5 4 3 2 1 0 Total PB UART SSP BB 0.5 M B /s 5.15 0.46 0.24 0.84 3.61 1 M B / s 5.25 0.53 0.25 0.84 3.63 10 M B / s 6.44 1.37 0.24 0.84 3.99 Synchronous Peripherals dominate the Power spectrum REASON: A small piece of Asynchronous in a BIG synchronous World MEANS frequent interfacing AND HENCE smaller up-scale of advantages Slide 26

Metrics: Full Top-Level PB System Ratio to Synchronous 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Cells Gates (NAND2) Raw Cell A rea sync_1i27t 1.0 0 1.0 0 1.0 0 async_1i27t 1.2 6 1.17 1.15 Increase 26% 17% 15% Adapter overhead is small at the PB system level ~15% raw area... add in WIRES: 66% fewer wires which should result in: better routing flexibility better layout density Slide 27

Interface Adaptation Metrics Ratio to Asynchronous Interface 6 5 4 3 2 1 0 Cells Gates Raw Area Latency All 3 adaptation schemes worked! Asynchronous 1.0 1.0 1.0 0 Synchro nizing 2.4 3.4 3.2 4 P ausible 2.7 5.0 4.6 2 Clock KEY learning is HERE... Slide 28

Latency and bandwidth PB had no latency requirement, but every transfer was 2 cycles, with no transfer overlapping or pipelining all latency directly limits bandwidth. PB bus protocol requires 2 cycles per transfer Self-timed time interval Clocked time interval (one cycle) Slide 29

Latency and bandwidth 1 rsp transfer PB had no latency requirement, but every transfer was 2 cycles, with no Client DMA/bridge transfer overlapping or pipelining all latency directly limits bandwidth. 1 cmd transfer PB bus protocol requires 2 cycles per transfer Self-timed time interval Clocked time interval (one cycle) Slide 30

Latency and bandwidth 2 protocol clocking 1 rsp transfer 2 rsp synchronization PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 2 - cmd synchronization 1 cmd transfer 1 cmd setup PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Self-timed time interval Clocked time interval (one cycle) Slide 31

Latency and bandwidth 2 protocol clocking 1 rsp transfer 2 rsp synchronization PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 1 cmd transfer 1 cmd setup PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Self-timed time interval Clocked time interval (one cycle) Slide 32

Latency and bandwidth 1 rsp transfer 2 rsp synchronization PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 1 cmd transfer 1 cmd setup PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Async peripheral interface ~ 5 cycles Removes ~2 cycles protocol clocking overhead Self-timed time interval Clocked time interval (one cycle) Slide 33

Latency and bandwidth 1 rsp transfer 2 rsp synchronization PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 1 cmd transfer PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Async peripheral interface ~ 5 cycles Removes ~2 cycles protocol clocking overhead Logic optimization ~ 4 cycles Bus arbitration cycle unnecessary for PB protocol Self-timed time interval Clocked time interval (one cycle) Slide 34

Latency and bandwidth 1 rsp transfer PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 1 cmd transfer PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Async peripheral interface ~ 5 cycles Removes ~2 cycles protocol clocking overhead Logic optimization ~ 4 cycles Bus arbitration cycle unnecessary for PB protocol Async bridge ~ 2 cycles Removes 2 cycles of response synchronization Self-timed time interval Clocked time interval (one cycle) Slide 35

Latency and bandwidth PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. Self-timed time interval AND 1 rsp transfer 1 cmd transfer PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Async peripheral interface ~ 5 cycles Removes ~2 cycles protocol clocking overhead Logic optimization ~ 4 cycles Bus arbitration cycle unnecessary for PB protocol Async bridge ~ 2 cycles Removes 2 cycles of response synchronization Future: Concurrent command&response ~1 cycle 200% (2x improvement) of target Clocked time interval (one cycle) Slide 36

Key Learnings & Future Directions Partition with NoC in Mind Minimize the number of timing domain crossings Partition between NoC, Peripherals & Interface logic Encapsulate asynchronous NoCs to simplify integration with mostlysynchronous tools Take advantage of NoC Strengths! Exploit the layered communication approach Concurrency can dramatically improve throughput & latency Lower IP generation & validation costs Self-timed NoC promotes faster timing closure and lower standby power Employ Transaction Level Test Suites They were invaluable in testing, debugging, and benchmarking our NoCs Enables portable, maintainable, flexible validation suites re-usable at multiple levels of abstraction Real SoC traffic isn t homogeneous, and is much easier to model in a flexible, modular TLT Slide 37

Key Learnings & Future Directions It s still a mostly-synchronous SoC world New methods must seamlessly integrate with mostly-synchronous flows Static Timing analysis flows & engines need to be enhanced to better handle complex multi-frequency and asynchronous design content (see SRC investigation by Beerel/Stevens) SoC Developers want flexibility in choosing Power, Latency, Bandwidth and Area Our four-phase 1-hot QDI style was very robust, but a limiting factor in power reduction and achievable bandwidth We see potential benefits in two-phase, single-rail and alternate QDI encodings We expect that additional asynchronous cells and single-rail FIFOs will enable further improvements Slide 38

Summary We built an asynchronous NoC in a synchronous SoC flow, today We demonstrated asynchronous NoC advantages We explored a number of tradeoffs We learned lessons & identified areas for further development Slide 39

Asynchronous NoC in SoC. Do it. Slide 40