Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus

Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus Andrew M. Scott, Mark E. Schuelein, Marly Roncken, Jin-Jer Hwan John Bainbridge, John R. Mawer, David L. Jackson, Andrew Bardsley 1 Slide 1

Introduction 2 Slide 2

Intel PXA27x Processor Design General Purpose I/O (GPIO) RTC OS Timers 4x PWM Interrupt 3x SSP USIM I 2 S AC 97 Std UART Full UART Bluetooth UART Fast Infrared I 2 C USB Client BB Interface Keypad Interface MMC/SC/SDIO Memory Stick USB On-The-Go 32.7 khz 32.768 khz / 13 MHz 13 / 26 MHz 32.768 khz / 13 MHz /104 MHz 13 MHz / 26 MHz 48 MHz 2 to 12 MHz 12.288 MHz 14.7 MHz 14.7 MHz 14.7 MHz 48 MHz 33.3 MHz 48 MHz 48 MHz 32 khz 20 / 25 MHz to 20 MHz 48-65 MHz Peripheral Bus (PB) 13 MHz / 26 MHz DMA Controller & Bridge Quick Capture Interface Intel Wireless MMX TM Power Management / Clock Control Internal SRAM LCD Controller System Bus 104/133/208 MHz Intel Xscale Core Debug Controller 32.766 khz Osc USB Host Controller 13 MHz Osc Memory Controller Address & Data Variable Latency I/O Control PC Card / CompactFlash Control Dynamic Memory Control Static Memory Control Slide 3

Async Peripheral Bus Team General Purpose I/O (GPIO) RTC OS Timers 4x PWM Interrupt 3x SSP USIM I 2 S AC 97 Std UART Full UART Bluetooth UART Fast Infrared I 2 C USB Client BB Interface Keypad Interface MMC/SC/SDIO Memory Stick USB On-The-Go 32.7 khz 32.768 khz / 13 MHz 13 / 26 MHz 32.768 khz / 13 MHz /104 MHz 13 MHz / 26 MHz 48 MHz 2 to 12 MHz 12.288 MHz 14.7 MHz 14.7 MHz 14.7 MHz 48 MHz 33.3 MHz 48 MHz 48 MHz 32 khz 20 / 25 MHz to 20 MHz 48-65 MHz Peripheral Bus (PB) 13 MHz / 26 MHz DMA Controller & Bridge Quick Capture Interface Intel Wireless MMX TM Power Management / Clock Control SoC Flow Development Internal SRAM LCD Controller System Bus 104/133/208 MHz Intel Xscale Core Debug Controller Andy & Jin-Jer 32.766 khz Osc USB Host Controller 13 MHz Osc Memory Controller Address & Data Extreme Low-Power product Design Mark & Mark Fullerton Asynchronous Tools Marly & Andrew Variable Latency I/O Control PC Card / CompactFlash Control Dynamic Memory Control Asynchronous Fabrics John John & Dave Static Memory Control Slide 5

Objectives Build an Asynchronous NoC in a Synchronous SoC flow Assess design tradeoffs from Product Developer s Perspective Identify gaps in current design capabilities Slide 6

What do SoC Developers worry about? Inflexible product-introduction cycles Shorter product lead times & product life times Growing Complexity Design & Manufacturing rules Product & System design Slide 7

What do SoC Developers want from their flow? Fast integration and validation of IP Minimal IP and IP-collateral redesign Modular Design Flows Minimal disruption to their synchronous SoC flow Slide 8

Exploration - What we did 9 Slide 9

Peripheral Bus Baseline Design General Purpose I/O (GPIO) RTC OS Timers 4x PWM Interrupt 3x SSP SSP USIM I 2 S AC 97 Std UART Full UART Bluetooth UART Fast Infrared I 2 C USB Client BB Interface Keypad Interface 32.7 khz 32.768 khz / 13 MHz 13 / 26 MHz 32.768 khz / 13 MHz /104 MHz 13 MHz / 26 MHz 48 MHz 2 to 12 MHz 12.288 MHz 14.7 MHz 14.7 MHz 14.7 MHz 48 MHz 33.3 MHz 48 MHz 48 MHz 32 khz Peripheral Bus (PB) 13 MHz / 26 MHz DMA Controller & Bridge 3 representative Bus Slaves: SSP UART Baseband Peripheral Bus Fabric Bus Master DMA Controller MMC/SC/SDIO Memory Stick USB On-The-Go 20 / 25 MHz to 20 MHz 48-65 MHz Slide 10

Peripheral Bus Synchronous Interface SSP UART Std UART DMAC BB BB Interface PB clock domain SSP UART BB DMAC clock domains synchronizer Slide 11

Peripheral Bus Asynchronous Interface SSP UART Std UART Synchronizing Synchronizing 3 Async Interface Adaptations Synchronizing Adapter no Master/Slave redesign simplest, adds synchronizers DMAC BB Interface Asynchronous clock Pausible Clock Adapter no Master/Slave redesign locally generated interface clock no extra synchronizers Pausible Clock Asynchronous Interface requires redesign (UART) removes synchronizers to PB Slide 12

Transaction Level Testing 13 Slide 13

Transaction-Level Testing (TLT) Test scope Functional coverage Stress & Error Conditions Multiple Use Models & Traffic Scenarios Peripheral, Subsystem, and System Level How? Specify Transactions Automatic Protocol Adherence and Results Checking Strengths Test Re-use at Peripheral, Subsystem & System level Test Re-use for Synchronous & Asynchronous Facilitated abstraction to higher-level traffic patterns AND HENCE: Highly portable & powerful!!! Slide 14

EDA Flow & Network Construction 15 Slide 15

Scope Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Build 5 representative Synchronous & Asynchronous top-level networks Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Evaluate Functionality Timing & Realistic Power Metrics Assess Asynchronous Design & EDA Flow integration issues Slide 16

Silístix Design Entry & Asynchronous Network Construction Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Enter high-level description of self-timed NoC topology Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Generate hierarchical, structural Verilog netlists Modify UART PB-facing logic to attach directly to the asynchronous fabric Slide 17

Intel Design Entry & Network Construction Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Typical Low-Power SoC flow uses commercial EDA tools used for wide product & process range (180, 130, 90nm etc.) Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Slide 18

Intel Design Entry & Network Construction Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Synthesis & Netlist Integration Intel Stdcell / SRAM libraries Intel Design Entry Intel PXA27x IP, text Our Usage Model: Synthesize synchronous blocks at 2 PVT corners 1M-gate Wire-load model to match original 27-peripheral PB No clock-gating, scan-insertion RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Import Asynchronous blocks Stitch top-level networks Slide 19

Evaluation Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Validate SoC flow usage Gate-to-gate FV For key synchronous blocks Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Slide 20

Evaluation Silístix Design Entry Async network IP, text Async Network Generation Silístix tools Intel Design Entry Intel PXA27x IP, text Dynamic Simulation Functionality, Timing & Power Unit-delay models Back-annotation, 2 PVT corners Synthesis & Netlist Integration Intel Stdcell / SRAM libraries RTL & Gate-level Validation functionality, latency, throughput Gate-level Power Simulation Metric (Area) Analysis Intel Stdcell / SRAM models Intel & Formal Verification to validate PB flow scripting Typical PB Traffic scenarios 0.5MB/s (PB idle) 1MB/s (PB Normal) 10MB/s (PB max) Netlist-based Metric collection Slide 21

Top-Level Networks Small test cases for debug: sync_1i3t, async_1i3t 1x (UART-sync, BB, SSP), 1x (UART-async, BB, SSP) Primary test cases for Async-Sync comparisons: async_127t 9x (UART-async, BB, SSP) UART-sync + Synchronizing Adapter substituted for Metrics sync_1i27t 9x (UART-sync, BB, SSP) + up-scaled PB_MUX Extra test case to check scaling properties: async_1i30t 10x (UART-async, BB, SSP) Slide 22

Results 23 Slide 23

Active Power x Traffic: Async Fabric Active Power (uw) 200 150 100 50 0 94% async_1i3t async_1i27t async_1i30t 0.5 M B /s 6 11 11 1 M B / s 13 26 26 10 M B / s 10 1 19 7 19 7 Async Fabric Power SCALES with traffic Slide 24

Active Power x Traffic: UART Active Power (uw) 800 600 400 200 0 0.5 MB/s 1 MB/s 10 MB/s UA RT-sync 848 855 892 UA RT-async 242 253 248 Reduction 71% 70% 72% 70% Lower Power for Async Redesign NOTE Async power scaling not visible for given TLT Slide 25

Active Power x Data: async_1i27t Active Power (mw) 7 6 5 4 3 2 1 0 Total PB UART SSP BB 0.5 M B /s 5.15 0.46 0.24 0.84 3.61 1 M B / s 5.25 0.53 0.25 0.84 3.63 10 M B / s 6.44 1.37 0.24 0.84 3.99 Synchronous Peripherals dominate the Power spectrum REASON: A small piece of Asynchronous in a BIG synchronous World MEANS frequent interfacing AND HENCE smaller up-scale of advantages Slide 26

Metrics: Full Top-Level PB System Ratio to Synchronous 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Cells Gates (NAND2) Raw Cell A rea sync_1i27t 1.0 0 1.0 0 1.0 0 async_1i27t 1.2 6 1.17 1.15 Increase 26% 17% 15% Adapter overhead is small at the PB system level ~15% raw area... add in WIRES: 66% fewer wires which should result in: better routing flexibility better layout density Slide 27

Interface Adaptation Metrics Ratio to Asynchronous Interface 6 5 4 3 2 1 0 Cells Gates Raw Area Latency All 3 adaptation schemes worked! Asynchronous 1.0 1.0 1.0 0 Synchro nizing 2.4 3.4 3.2 4 P ausible 2.7 5.0 4.6 2 Clock KEY learning is HERE... Slide 28

Latency and bandwidth PB had no latency requirement, but every transfer was 2 cycles, with no transfer overlapping or pipelining all latency directly limits bandwidth. PB bus protocol requires 2 cycles per transfer Self-timed time interval Clocked time interval (one cycle) Slide 29

Latency and bandwidth 1 rsp transfer PB had no latency requirement, but every transfer was 2 cycles, with no Client DMA/bridge transfer overlapping or pipelining all latency directly limits bandwidth. 1 cmd transfer PB bus protocol requires 2 cycles per transfer Self-timed time interval Clocked time interval (one cycle) Slide 30

Latency and bandwidth 2 protocol clocking 1 rsp transfer 2 rsp synchronization PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 2 - cmd synchronization 1 cmd transfer 1 cmd setup PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Self-timed time interval Clocked time interval (one cycle) Slide 31

Latency and bandwidth 2 protocol clocking 1 rsp transfer 2 rsp synchronization PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 1 cmd transfer 1 cmd setup PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Self-timed time interval Clocked time interval (one cycle) Slide 32

Latency and bandwidth 1 rsp transfer 2 rsp synchronization PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 1 cmd transfer 1 cmd setup PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Async peripheral interface ~ 5 cycles Removes ~2 cycles protocol clocking overhead Self-timed time interval Clocked time interval (one cycle) Slide 33

Latency and bandwidth 1 rsp transfer 2 rsp synchronization PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 1 cmd transfer PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Async peripheral interface ~ 5 cycles Removes ~2 cycles protocol clocking overhead Logic optimization ~ 4 cycles Bus arbitration cycle unnecessary for PB protocol Self-timed time interval Clocked time interval (one cycle) Slide 34

Latency and bandwidth 1 rsp transfer PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. 1 cmd transfer PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Async peripheral interface ~ 5 cycles Removes ~2 cycles protocol clocking overhead Logic optimization ~ 4 cycles Bus arbitration cycle unnecessary for PB protocol Async bridge ~ 2 cycles Removes 2 cycles of response synchronization Self-timed time interval Clocked time interval (one cycle) Slide 35

Latency and bandwidth PB had no latency requirement, but every transfer was 2 cycles, with no Client Network Network Adapter Adapter DMA/bridge transfer overlapping Gateway or pipelining all latency Gateway directly limits bandwidth. Self-timed time interval AND 1 rsp transfer 1 cmd transfer PB bus protocol requires 2 cycles per transfer Synchronizing adapter ~ 9 cycles Latency limits bandwidth - only 90% of target Pausible clock adapter ~ 7 cycles Removes 2 cycles of command synchronization Async peripheral interface ~ 5 cycles Removes ~2 cycles protocol clocking overhead Logic optimization ~ 4 cycles Bus arbitration cycle unnecessary for PB protocol Async bridge ~ 2 cycles Removes 2 cycles of response synchronization Future: Concurrent command&response ~1 cycle 200% (2x improvement) of target Clocked time interval (one cycle) Slide 36

Key Learnings & Future Directions Partition with NoC in Mind Minimize the number of timing domain crossings Partition between NoC, Peripherals & Interface logic Encapsulate asynchronous NoCs to simplify integration with mostlysynchronous tools Take advantage of NoC Strengths! Exploit the layered communication approach Concurrency can dramatically improve throughput & latency Lower IP generation & validation costs Self-timed NoC promotes faster timing closure and lower standby power Employ Transaction Level Test Suites They were invaluable in testing, debugging, and benchmarking our NoCs Enables portable, maintainable, flexible validation suites re-usable at multiple levels of abstraction Real SoC traffic isn t homogeneous, and is much easier to model in a flexible, modular TLT Slide 37

Key Learnings & Future Directions It s still a mostly-synchronous SoC world New methods must seamlessly integrate with mostly-synchronous flows Static Timing analysis flows & engines need to be enhanced to better handle complex multi-frequency and asynchronous design content (see SRC investigation by Beerel/Stevens) SoC Developers want flexibility in choosing Power, Latency, Bandwidth and Area Our four-phase 1-hot QDI style was very robust, but a limiting factor in power reduction and achievable bandwidth We see potential benefits in two-phase, single-rail and alternate QDI encodings We expect that additional asynchronous cells and single-rail FIFOs will enable further improvements Slide 38

Summary We built an asynchronous NoC in a synchronous SoC flow, today We demonstrated asynchronous NoC advantages We explored a number of tradeoffs We learned lessons & identified areas for further development Slide 39

Asynchronous NoC in SoC. Do it. Slide 40