Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor

Similar documents
Introduction to Network Processors: Building Block for Programmable High- Speed Networks. Example: Intel IXA

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Assertion-Based Design Exploration of DVS in Network Processor Architectures

NEPSIM: A NETWORK PROCESSOR SIMULATOR WITH A POWER EVALUATION FRAMEWORK

Workload Characterization and Performance for a Network Processor

Chapter 2. OS Overview

5008: Computer Architecture HW#2

NPCryptBench: A Cryptographic Benchmark Suite for Network Processors

Topic & Scope. Content: The course gives

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Towards Optimized Packet Classification Algorithms for Multi-Core Network Processors

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Lecture 4: RISC Computers

Computer Architecture Homework Set # 1 COVER SHEET Please turn in with your own solution

Processor Architecture

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Architecture-Aware Session Lookup Design for Inline Deep Inspection on Network Processors *

Adapted from David Patterson s slides on graduate computer architecture

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Instruction Set Principles and Examples. Appendix B

INF5060: Multimedia data communication using network processors Memory

Lecture 4: RISC Computers

Mapping Stream based Applications to an Intel IXP Network Processor using Compaan

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Static, multiple-issue (superscaler) pipelines

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

ARM Processors for Embedded Applications

The Nios II Family of Configurable Soft-core Processors

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Copyright 2012, Elsevier Inc. All rights reserved.

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

General Purpose Signal Processors

Memory technology and optimizations ( 2.3) Main Memory

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

CSCE 212: FINAL EXAM Spring 2009

COSC 6385 Computer Architecture - Memory Hierarchies (III)

The Impact of Parallel and Multithread Mechanism on Network Processor Performance

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CMSC411 Fall 2013 Midterm 1

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 5: MEMORY HIERARCHY DESIGN

PowerPC 740 and 750

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Processors, Performance, and Profiling

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

A Register Allocation Framework for Banked Register Files with Access Constraints

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Intel Technology Journal

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Hardware-Based Speculation

55:132/22C:160, HPCA Spring 2011

CSEE W4824 Computer Architecture Fall 2012

COMPUTER ORGANIZATION AND DESI

Advanced Instruction-Level Parallelism

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Computer Systems A Programmer s Perspective 1 (Beta Draft)

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

CS370 Operating Systems

Designing for Performance. Patrick Happ Raul Feitosa

Memory latency: Affects cache miss penalty. Measured by:

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Memory Hierarchy Basics

Chapter 13 Reduced Instruction Set Computers

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension

Memory latency: Affects cache miss penalty. Measured by:

A 1-GHz Configurable Processor Core MeP-h1

CPSC614: Computer Architecture

Computer System Components

Network Processors Outline

Computer Systems Laboratory Sungkyunkwan University

Control Hazards. Branch Prediction

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Parallel Processing SIMD, Vector and GPU s cont.

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

ECEC 355: Pipelining

Processor (IV) - advanced ILP. Hwansoo Han

NAVAL POSTGRADUATE SCHOOL THESIS

Transcription:

Performance of Embedded System Application on Network Processor 2006 Spring Directed Study Project Danhua Guo University of California, Riverside dguo@cs.ucr.edu 06-07 07-2006 Motivation NP Overview Programmability (GPP) Performance (ASIC) Project Motivation Architectural Portability Performance ASIC NP GPP Programmability Motivation (Cont.) NP has been proved to be a successful design for packet processing. NP architecture (XScale( + MEs) ) is similar to Embedded Processor architecture (ARM + DSP). Embedded application performance on NP? Intel IXA 200 NP Architecture Intel Corporation Intel IXP200 Network

IXA 200 NP Resources Summary Scratch SRAM DRAM ME ME ME ME Core Programming Model ME ME ME ME Context Pipeline Name Size Transfer Reference (Bytes) Size (Bytes) Latency (Cycles) XScale Core 32-bit general-purpose processor Application Initialize and manage the chip ME0 GPR 256* Incurs no penalty for context switch XferR 52* Hold data while asynchronous I/O is pending NNR NNR 28* Transfer data from previous/next neighbor ME LM Scratch Scratch SRAM 60* 6K 6M 3 60 90 Caching data needed by ME General purpose use with atomic operations and ring support Control information storage Context is passed through the pipeline A single ME is dedicated entirely to a single function SDRAM 2G 6 20 Data buffer storage Intel Corporation Intel IXP200 Network Functional Pipeline Context Vs. Functional ME0 Functions are passed through pipeline A single ME can constitute a functional pipeline Intel Corporation Intel IXP200 Network +++: Good for programs with a large code size Good for on chip storage of vectors/tables ---: Bad for large context passing All pipe stages must execute at minimum packet arrival rates. (hard to partition) +++: Supports a longer execution period than context pipe-stages Exe time for each pipe-stage is flexible ---: ME must support multiple functions Mutual exclusion may be more difficult Which one is better? A combination of Context and Functional Model Intel Corporation Intel IXP200 Network 2

Environment: Intel IXA SDK Workbench A software development kit for NP programming Input: MicroC or MicroCode Configuration: Data Flow, ME distribution, etc. Output:.list file for each ME Output MiBench Overview Automotive Office Security Telecommunication Consumer Type of Embedded Application Control Intensive: branch instructions Computational Intensive: integer and floating point ALU operations I/O Intensive: memory (load and store) Experiment Outline Port map MiBench to SDK Configure SDK Environment Feed.list into SDK Analyze and Evaluate Result Port mapping Why is it required? Why is it required? The C compiler is not a complete ANSI C implementation. Things the compiler does not support the full standard C runtime library. automatic parallelization of code. C++ floating point data types (float and double). function pointers and recursion. functions with a variable number of arguments (varargs). 3

Port mapping (Cont.) How is it working? Data Allocation Register Regions: GPR, NN, Xfer Memory Regions: LM, Scratch, SRAM, DRAM Allocation attributes ( declspecs declspecs) ) can describe where the variable is allocated. Register Spillage Default spill order: Next Neighbor registers Local Memory SRAM Memory Functions Inline function (function calling overhead) Optimizing pointer arguments (callee( access overhead) Port mapping (Cont.) Optimizing Your Code Minimize variable allocation to memory. Taking the address of a variable or declaring it with a memory region attribute causes allocation to memory. Structures/arrays larger than 6 bytes (or 28 bytes in - context mode) are allocated to memory. Minimize access to memory variables. When possible, declare a variable as a local variable in main() rather than as a global or a static variable so that unnecessary initialization is avoided. Efficient Structure Access Sizing of Structure Members A multiple of four bytes in size Falling on a four byte offset from the start of a structure Do not cross a byte boundary SDK Configuration Library directory Which library to include Memory information Which part of memory segments is reserved for variables Thread information Which mode to use # of contexts (threads) to run ME information Which.list file to be fed into which ME Environment Intel IXP SDK. (Select IXP200) 600MHz ME configurations 200-MHz SRAMs 50-MHz RDRAMs Executed in Multi-threads threads Executed in Multi-MicroEngines MicroEngines Experiment Result Experiment Result (Cont.) MicroEngine Utilization Percentage On 8-Context 8 Mode on ME Throughput (Mips( Mips) Multi-core Multi-thread Throughput on Different # of MicroEngines Throughput on Different String_search MicroEngine Utilization Percentage Rijndael MicroEngine Utilization Percentage 2500 500 00% 00% 50 Percentage 90% 80% 70% 60% 50% 0% Idle Stalled Aborted Executing Percentage 90% 80% 70% 60% 50% 0% Idle Stalled Aborted Executing Throughput (Mips) 2000 500 000 string_search Rijndael Throughput (Mips) 00 350 300 250 200 50 string_search Rijndael 30% 30% 500 00 20% 20% 50 0% 0% 0% 0% 0 2 8 0 2 8 2 8 2 8 # of Microengines

References Thanks to Intel C Compiler User s s Guide Intel IXA Development Tools User s s Guide Matthew R. Guthaus et. al., MiBench: A free, commercially representative embedded benchmark suite Zhangxi Tan et. al., Optimization and Benchmark of Cryptographic Algorithms on Network Processor, 200 Intel Corporation Intel IXP200 Network Processor - 2nd Generation Intel NPU Yan Luo et. al., NePSim: A Network Processor Simulator with a Power Evaluation Framework, 200 Dominic Herity, Network Processor Programming, http://www.embedded.com/story/oeg2000730s0053 Prof. Bhuyan for the idea of simulation Prof. Luo for the suggestions on SDK Mr. Piyush Satapathy for the experience on MicroC coding Everyone for giving up your valuable afternoon-coffee time Questions or Comments? 5