Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor

Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor Gert Goossens, Patrick Verbist, Erik Brockmeyer, Luc De Coster Synopsys 1

Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions 2

ROHC in Network Processing High Performance Streaming Data (IP/UDP/RTP Protocol) IP Header 20-40 bytes UDP Hdr 8 bytes RTP Header 12 bytes Payload Video/Audio ROHC Header Payload Video/Audio ROHC Compressor Radio or Cable Link ROHC Decompressor ROHC compressor Feedback Buffer Header Parser Context Processor Header Field Encoder CRC Con- Text Mem Packet Modification Buffer 1.2 Mpackets/s 600MHz clock 500 cycles/packet Header Parser: ~100 cycles/packet Encoder+Context+CRC: ~400 cycles/packet Optimize for worst-case control path 3

ROHC Implementation Feedback Buffer Context Processor CRC Con- Text Mem Header Parser Header Field Encoder Packet Modification Buffer Blocks requiring efficient control-flow Tiny microprocessor with efficient branching and logic operations Blocks requiring efficient control-flow and data processing Tiny microprocessor with hardware-accelerated instructions ASIP technology enables the design of such processors 4

ASIPs in SoC Design ASIP architectural optimization space Parallelism Specialization Instructionlevel parallelism Datalevel parallelism Tasklevel parallelism Applic.- specific data types Applic.- specific instructions Connectivity & storage matching application s data-flow Pipeline Multithreading Microprocessor Extensible Processor Application-Specific up / DSP Orthogonal instruction set (VLIW) Encoded instruction set Vector processing (SIMD) Multicore Integer, fractional, floating-point, bits, complex, vector App.-spec. memory addressing App.-spec. data processing Distributed regs, sub-ranges Multiple mem s, sub-ranges App.-spec. control processing Pipeline depth Hazards: HW/SW stall, bypass Programmable Datapath Hardwired Datapath Direct, indirect, post-modification, indexed, stack indirect Any exotic operator Single or multi-cycle Jumps, subroutines, interrupts, HW do-loops, residual control, predication Relative or absolute, address range, delay slots 6

ASIP Designer Tool-Suite 7

Accelerated Control Processing Customization of a 16-bit CPU: Strip Down & Beef Up Architectural exploration with ASIP Designer Starting point: Tmicro CPU 16-bit gen.-purpose CPU (already leaner than 32-bit) Variable-length instructions: arithmetic (16), move (16, 32), load/store (16, 32), control (16, 32, 48) End point: Tnano ASIP 16-bit stripped CPU Fixed-length instructions: arithmetic, move, load/store, control (16) No multi-word decoding overhead Improved clock frequency Add compact control instructions to accelerate ROHC code Predicated execution (Selection) Field extraction (Masking) Shortcut logic instructions 9

Accelerated Control Processing Control Path Balancing Longest control path Shortest control path Example: Control-Flow Graph of Header Parser Improve control path balancing by C source code re-factorization User-control on code hoisting Predicated execution in tail of long control paths 10

Accelerated Control Processing If-Else, No Predication Tmicro (gen.-purp. CPU) C Condition at tail of long control path nml Conditional jump instruction, 2-cycle branch penalty Machine code Conditional jump with branch penalty: One of two delay slots filled, one nop left 11

Accelerated Control Processing Predication Tnano (optimized ASIP) C Condition at tail of long control path nml Select instruction Machine code Conditional code executes always Result is used selectively No branch penalty nml Predication Threshold 12

Accelerated Control Processing If-Else with Multiple Tests Tmicro (gen.-purp. CPU) C If-else with multiple tests nml Stand-alone compare instruction Machine code Multiple compare and c-jump instructions Slow in worst-case 13

Accelerated Control Processing If-Else with Multiple Tests Tnano (optimized ASIP) C If-else with multiple tests nml Compare + shortcut-logic instruction CND &= Rj==Ri CND = Rj!=Ri Machine code Multiple compare + shortcut-logic Single c-jump Worst case is always faster! 14

Accelerated Control Processing Results Header Parser Tmicro CPU Tnano ASIP Rohc_parse program code size 347 x 16-bit 227 x 16-bit (-35%) Rohc_parse cycle count per packet 191 87 (-55%) Clock frequency (28nm HPM) 800 MHz 1 GHz (+25%) Gate count (core only, 28nm HPM) 14K gates 5.4K gates (-61%) 15

Accelerated Data Processing CRC Feedback Buffer Header Parser Context Processor Header Field Encoder CRC Con- Text Mem Packet Modification Buffer Scaled / Timer-Based RTP Timestamp Compression WLSB encoder. Implementation styles Software on processor: too slow? Hardware co-processors: (manual) design effort, synchronization challenge? Hardware-accelerated instructions in ASIP instruction set: well supported by tools, potential for resource sharing! 17

Accelerated Data Processing WLSB Encoder: SW Implementation Tmicro (gen.-purp. CPU) nml General-purpose ALU: add, sub, shift, mask C Software implementation of WLSB encoder: forloop with called function Machine code 30 instructions for called function 6-packet test program: 2110 cycles 18

Accelerated Data Processing WLSB Encoder: HW-Accelerated Instruction Tnano (optimized ASIP) C Intrinsic function call to WLSB encoder instruction nml (behavioral view) WLSB hardware primitive in bit-accurate C code Auto-translated to RTL nml (ISA view) WLSB encoder instruction, calling hardware primitive Machine code Called function replaced by single instruction 6-packet test program: 267 cycles (7.9x speedup) 19

Accelerated Data Processing Results: Adding HW-Accelerated Instructions WLSB 6-packet test program code size WLSB 6-packet test program cycle count Clock frequency (28nm HPM) Gate count (core only, 28nm HPM) Tmicro CPU Tnano ASIP Tnano ASIP w/ WLSB instr 134 x 16-bit 126 x 16-bit 84 x 16-bit (-33%) 2122 2110 267 (-87%) 800 MHz 1 GHz 1 GHz (0%) 14K gates 5.4K gates 6.3K gates (+16%) 20

Conclusions Application-Specific Processors (ASIP) Enable acceleration of control and data processing, similar to fixed-function hardware Flexibility of a software-programmable processor ASIP Designer allows to design ASIPs quickly Architectural exploration: Compiler-in-the-Loop SDK generation RTL generation Benefits illustrated with Robust Header Compression (ROHC) case study 22