Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor

Similar documents
Adding C Programmability to Data Path Design

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

The Processor: Instruction-Level Parallelism

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

LECTURE 3: THE PROCESSOR

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Processor (IV) - advanced ILP. Hwansoo Han

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Chapter 4. The Processor

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Full Datapath. Chapter 4 The Processor 2

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Chapter 4. The Processor

Pipelining. CSC Friday, November 6, 2015

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Advanced Instruction-Level Parallelism

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Lec 25: Parallel Processors. Announcements

Multi-cycle Instructions in the Pipeline (Floating Point)

Robust Header Compression for Multimedia Services

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Reminder: tutorials start next week!

Advanced Computer Architecture

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

04 - DSP Architecture and Microarchitecture

Full Datapath. Chapter 4 The Processor 2

Chapter 4. The Processor

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Processor (I) - datapath & control. Hwansoo Han

Multiple Instruction Issue. Superscalars

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

55:132/22C:160, HPCA Spring 2011

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipelining, Branch Prediction, Trends

Processor (II) - pipelining. Hwansoo Han

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

ECE 486/586. Computer Architecture. Lecture # 7

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Instr. execution impl. view

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

5008: Computer Architecture HW#2

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Chapter 4. The Processor

One instruction specifies multiple operations All scheduling of execution units is static

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

2 MARKS Q&A 1 KNREDDY UNIT-I

What is Pipelining? RISC remainder (our assumptions)

ECE260: Fundamentals of Computer Engineering

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining to Superscalar

RISC, CISC, and ISA Variations

The Nios II Family of Configurable Soft-core Processors

Chapter 4. The Processor Designing the datapath

INSTRUCTION LEVEL PARALLELISM

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

ECE 313 Computer Organization FINAL EXAM December 13, 2000

Chapter 4. The Processor

Chapter 2: Instructions How we talk to the computer

2 GHz = 500 picosec frequency. Vars declared outside of main() are in static. 2 # oset bits = block size Put starting arrow in FSM diagrams

EIE/ENE 334 Microprocessors

are Softw Instruction Set Architecture Microarchitecture are rdw

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Chapter 4 The Processor (Part 4)

COMPUTER ORGANIZATION AND DESIGN

CS425 Computer Systems Architecture

Dynamic Control Hazard Avoidance

Anand Raghunathan

Embedded Systems. 7. System Components

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design

Lecture 4: Instruction Set Architecture

COMPUTER ORGANIZATION AND DESIGN

Multithreaded Processors. Department of Electrical Engineering Stanford University

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

LECTURE 10. Pipelining: Advanced ILP

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Instruction Set Principles and Examples. Appendix B

Pipeline Architecture RISC

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Transcription:

Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor Gert Goossens, Patrick Verbist, Erik Brockmeyer, Luc De Coster Synopsys 1

Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions 2

ROHC in Network Processing High Performance Streaming Data (IP/UDP/RTP Protocol) IP Header 20-40 bytes UDP Hdr 8 bytes RTP Header 12 bytes Payload Video/Audio ROHC Header Payload Video/Audio ROHC Compressor Radio or Cable Link ROHC Decompressor ROHC compressor Feedback Buffer Header Parser Context Processor Header Field Encoder CRC Con- Text Mem Packet Modification Buffer 1.2 Mpackets/s 600MHz clock 500 cycles/packet Header Parser: ~100 cycles/packet Encoder+Context+CRC: ~400 cycles/packet Optimize for worst-case control path 3

ROHC Implementation Feedback Buffer Context Processor CRC Con- Text Mem Header Parser Header Field Encoder Packet Modification Buffer Blocks requiring efficient control-flow Tiny microprocessor with efficient branching and logic operations Blocks requiring efficient control-flow and data processing Tiny microprocessor with hardware-accelerated instructions ASIP technology enables the design of such processors 4

Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions 5

ASIPs in SoC Design ASIP architectural optimization space Parallelism Specialization Instructionlevel parallelism Datalevel parallelism Tasklevel parallelism Applic.- specific data types Applic.- specific instructions Connectivity & storage matching application s data-flow Pipeline Multithreading Microprocessor Extensible Processor Application-Specific up / DSP Orthogonal instruction set (VLIW) Encoded instruction set Vector processing (SIMD) Multicore Integer, fractional, floating-point, bits, complex, vector App.-spec. memory addressing App.-spec. data processing Distributed regs, sub-ranges Multiple mem s, sub-ranges App.-spec. control processing Pipeline depth Hazards: HW/SW stall, bypass Programmable Datapath Hardwired Datapath Direct, indirect, post-modification, indexed, stack indirect Any exotic operator Single or multi-cycle Jumps, subroutines, interrupts, HW do-loops, residual control, predication Relative or absolute, address range, delay slots 6

ASIP Designer Tool-Suite 7

Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions 8

Accelerated Control Processing Customization of a 16-bit CPU: Strip Down & Beef Up Architectural exploration with ASIP Designer Starting point: Tmicro CPU 16-bit gen.-purpose CPU (already leaner than 32-bit) Variable-length instructions: arithmetic (16), move (16, 32), load/store (16, 32), control (16, 32, 48) End point: Tnano ASIP 16-bit stripped CPU Fixed-length instructions: arithmetic, move, load/store, control (16) No multi-word decoding overhead Improved clock frequency Add compact control instructions to accelerate ROHC code Predicated execution (Selection) Field extraction (Masking) Shortcut logic instructions 9

Accelerated Control Processing Control Path Balancing Longest control path Shortest control path Example: Control-Flow Graph of Header Parser Improve control path balancing by C source code re-factorization User-control on code hoisting Predicated execution in tail of long control paths 10

Accelerated Control Processing If-Else, No Predication Tmicro (gen.-purp. CPU) C Condition at tail of long control path nml Conditional jump instruction, 2-cycle branch penalty Machine code Conditional jump with branch penalty: One of two delay slots filled, one nop left 11

Accelerated Control Processing Predication Tnano (optimized ASIP) C Condition at tail of long control path nml Select instruction Machine code Conditional code executes always Result is used selectively No branch penalty nml Predication Threshold 12

Accelerated Control Processing If-Else with Multiple Tests Tmicro (gen.-purp. CPU) C If-else with multiple tests nml Stand-alone compare instruction Machine code Multiple compare and c-jump instructions Slow in worst-case 13

Accelerated Control Processing If-Else with Multiple Tests Tnano (optimized ASIP) C If-else with multiple tests nml Compare + shortcut-logic instruction CND &= Rj==Ri CND = Rj!=Ri Machine code Multiple compare + shortcut-logic Single c-jump Worst case is always faster! 14

Accelerated Control Processing Results Header Parser Tmicro CPU Tnano ASIP Rohc_parse program code size 347 x 16-bit 227 x 16-bit (-35%) Rohc_parse cycle count per packet 191 87 (-55%) Clock frequency (28nm HPM) 800 MHz 1 GHz (+25%) Gate count (core only, 28nm HPM) 14K gates 5.4K gates (-61%) 15

Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions 16

Accelerated Data Processing CRC Feedback Buffer Header Parser Context Processor Header Field Encoder CRC Con- Text Mem Packet Modification Buffer Scaled / Timer-Based RTP Timestamp Compression WLSB encoder. Implementation styles Software on processor: too slow? Hardware co-processors: (manual) design effort, synchronization challenge? Hardware-accelerated instructions in ASIP instruction set: well supported by tools, potential for resource sharing! 17

Accelerated Data Processing WLSB Encoder: SW Implementation Tmicro (gen.-purp. CPU) nml General-purpose ALU: add, sub, shift, mask C Software implementation of WLSB encoder: forloop with called function Machine code 30 instructions for called function 6-packet test program: 2110 cycles 18

Accelerated Data Processing WLSB Encoder: HW-Accelerated Instruction Tnano (optimized ASIP) C Intrinsic function call to WLSB encoder instruction nml (behavioral view) WLSB hardware primitive in bit-accurate C code Auto-translated to RTL nml (ISA view) WLSB encoder instruction, calling hardware primitive Machine code Called function replaced by single instruction 6-packet test program: 267 cycles (7.9x speedup) 19

Accelerated Data Processing Results: Adding HW-Accelerated Instructions WLSB 6-packet test program code size WLSB 6-packet test program cycle count Clock frequency (28nm HPM) Gate count (core only, 28nm HPM) Tmicro CPU Tnano ASIP Tnano ASIP w/ WLSB instr 134 x 16-bit 126 x 16-bit 84 x 16-bit (-33%) 2122 2110 267 (-87%) 800 MHz 1 GHz 1 GHz (0%) 14K gates 5.4K gates 6.3K gates (+16%) 20

Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions 21

Conclusions Application-Specific Processors (ASIP) Enable acceleration of control and data processing, similar to fixed-function hardware Flexibility of a software-programmable processor ASIP Designer allows to design ASIPs quickly Architectural exploration: Compiler-in-the-Loop SDK generation RTL generation Benefits illustrated with Robust Header Compression (ROHC) case study 22