Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.

Similar documents
The S6000 Family of Processors

Ten Reasons to Optimize a Processor

Configurable and Extensible Processors Change System Design. Ricardo E. Gonzalez Tensilica, Inc.

Flash Memory Summit 2011

Introduction to Embedded Systems

VLSI Design Automation

VLSI Design Automation. Calcolatori Elettronici Ing. Informatica

IP CORE Design 矽智產設計. C. W. Jen 任建葳.

ARM Processors for Embedded Applications

Xtensa 7 Configurable Processor Core

VLSI Design Automation

Test and Verification Solutions. ARM Based SOC Design and Verification

Microprocessors, Lecture 1: Introduction to Microprocessors

The Nios II Family of Configurable Soft-core Processors

Fundamentals of Computer Design

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine

Embedded Systems. 7. System Components

Moore s Law: Alive and Well. Mark Bohr Intel Senior Fellow

Xtensa. Andrew Mihal 290A Fall 2002

Qsys and IP Core Integration

Embedded System Design

Technology Roadmap 2002

ECE332, Week 2, Lecture 3. September 5, 2007

ECE332, Week 2, Lecture 3

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

VLSI Design Automation. Maurizio Palesi

ECE 111 ECE 111. Advanced Digital Design. Advanced Digital Design Winter, Sujit Dey. Sujit Dey. ECE Department UC San Diego

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Computers: Inside and Out

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

Copyright 2016 Xilinx

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Hardware Design with VHDL PLDs IV ECE 443

Overview of Microcontroller and Embedded Systems

FPGA & Hybrid Systems in the Enterprise Drivers, Exemplars and Challenges

Embedded System Design

Veloce2 the Enterprise Verification Platform. Simon Chen Emulation Business Development Director Mentor Graphics

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Platform-based Design

ZeBu : A Unified Verification Approach for Hardware Designers and Embedded Software Developers

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Early Software Development Through Emulation for a Complex SoC

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

ECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

An Introduction to Programmable Logic

Putting MPSOC to Work in Multimedia

MPSoC Design Space Exploration Framework

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

S2C K7 Prodigy Logic Module Series

Outline Marquette University

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

More Course Information

Advanced Digital Design Using FPGA. Dr. Shahrokh Abadi

COMPUTER ORGANIZATION (CSE 2021)

Rapidly Developing Embedded Systems Using Configurable Processors

Building blocks for 64-bit Systems Development of System IP in ARM

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

The Veloce Emulator and its Use for Verification and System Integration of Complex Multi-node SOC Computing System

LX4180. LMI: Local Memory Interface CI: Coprocessor Interface CEI: Custom Engine Interface LBC: Lexra Bus Controller

AT-501 Cortex-A5 System On Module Product Brief

High-Performance, Highly Secure Networking for Industrial and IoT Applications

SoC Platforms and CPU Cores

Improving Application Performance with Instruction Set Architecture Extensions to Embedded Processors

FPQ6 - MPC8313E implementation

1. Data plane blocks can be optimized for different applications. 2. The IP blocks can be reused and the design complexity decreases.

Memory Systems IRAM. Principle of IRAM

Enabling the design of multicore SoCs with ARM cores and programmable accelerators

DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER

ALTERA FPGAs Architecture & Design

Logic and Physical Synthesis Methodology for High Performance VLIW/SIMD DSP Core

Embedded Systems: Architecture

Multimedia in Mobile Phones. Architectures and Trends Lund

Hardware/Software Co-Verification Using the SystemVerilog DPI

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

System-on-Chip. Outline. Example: iphone 3GS disassembled. System-on-Chip is Everywhere! SoC Challenges. SoC Challenges and Current Solutions

November 11, 2009 Chang Kim ( 김창식 )

COMPUTER ORGANIZATION (CSE 2021)

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

CHAPTER 1 Introduction

New System Solutions for Laser Printer Applications by Oreste Emanuele Zagano STMicroelectronics

From Concept to Silicon

CHAPTER 1 Introduction

Building supercomputers from embedded technologies

04 - DSP Architecture and Microarchitecture

Learning Outcomes. Spiral 3 1. Digital Design Targets ASICS & FPGAS REVIEW. Hardware/Software Interfacing

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Resource Efficiency of Scalable Processor Architectures for SDR-based Applications

Multi-Core Microprocessor Chips: Motivation & Challenges

Introduction 1. GENERAL TRENDS. 1. The technology scale down DEEP SUBMICRON CMOS DESIGN

Computer Architecture

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

FPGA Based Digital Design Using Verilog HDL

ECE 486/586. Computer Architecture. Lecture # 2

Platform for System LSI Development

Network Processors. Douglas Comer. Computer Science Department Purdue University 250 N. University Street West Lafayette, IN

Chapter 0 Introduction

Using FPGAs in Supercomputing Reconfigurable Supercomputing

Transcription:

Configurable s for SOC Design Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.

Why Listen to This Presentation? Understand how SOC design techniques, now nearly 20 years old, are breaking down under the continued application of Moore s Law Learn about the significant differences between fixed-isa microprocessor cores for SOC design and configurable cores Unlearn some outdated design habits

Today s Standard SOC Design Techniques are Failing HDLs and logic synthesis developed at a time when ASICs were ~200,000 gates 90nm silicon puts 200K gates in 1 mm 2, easily accommodates 500Kgate blocks, and 100M max gate counts per chip Hand-coded HDL productivity has not and cannot keep pace Verification now >70% of the total SOC design effort Not enough verification engineers in the world The only solution is massive adoption of pre-verified IP blocks Cost of fixing SOC bugs is rising A silicon respin costs more than $1M US (1.5M Euro) Late hardware/software integration Has not changed in 35 years SOC designs much more complex Example: Multiple audio, video compression standards

Increasing SOC Design Complexity Spurs Thirst for Processing Speed New multimedia standards set the pace for a rapid rise in computational power The tried-and-true way to get more processor performance has been clock rate increases achieved through Denard (Classical) Scaling Often misidentified as Moore s Law Moore s Law (2x more transistors every ~18 months) Lives! Denard Scaling died with the 90nm node Transistors no longer getting faster as quickly Leakage effects from low-vt transistors make low-power operation tough

Data Speed Increases Every Year. Why?

Rise in Packaged Microprocessor Clock Rate Over Time

Clock Frequency Inflection Point End of the Road for Clock-Rate Ramp Performance (2003 = 1.0) 10 Transistor-constrained clock (21%) Power-constrained clock (4%) 1 2003 2005 2007 2009 2011 2013 Year Based on International Technology Roadmap for Semiconductors 12/03

What Replaces Clock-Rate Increases? configuration New registers and register files sized to the data 24 bits for audio 56 bits for encryption New function units to match application 24-bit multipliers for audio Time-to-live field extraction/increment for networking Viterbi butterfly for mobile phones SAD (sum of absolute differences) pixel processing for video

ByteSwap: A Simple Example

ByteSwap in C and ASM unsigned ss = (s<<24) ((s<<8)&0xff0000) ((s>>8)&0xff00) (s>>24); slli a9, a14, 24 slli a8, a14, 8 srli a10, a14, 8 and a10, a10, a11 and a8, a8, a13 or a8, a8, a9 extui a9, a14, 24, 8 or a10, a10, a9 or a10, a10, a8 Form intermediate result bits 24-31 in register a9 Shift 32-bit word left by 8 bits, save in register a8 Shift 32-bit word right by 8 bits, save in register a10 Form intermediate result bits 9-15 in register a10 Form intermediate result bits 16-23 in register a8 Form intermediate result bits 16-31, save in register a8 Extract result bits 0-7, save in register a9 Form result bits 0-15, save in register a10 Form final result in register a10

Define ByteSwap Instruction in TIE, Use BYTESWAP as C Intrinsic One instruction versus nine operation BYTESWAP {out AR outr, in AR inpr} { } { wire [31:0] reg_swapped = {inpr[7:0],inpr[15:8],inpr[23:16],inpr[31:24]}; assign outr = reg_swapped; }

Define ByteSwap Instruction in TIE, Use BYTESWAP as C Intrinsic One instruction versus nine operation BYTESWAP {out AR outr, in AR inpr} { } { wire [31:0] reg_swapped = {inpr[7:0],inpr[15:8],inpr[23:16],inpr[31:24]}; assign outr = reg_swapped; } Writing this one line of TIE code and submitting it to the processor generator Automatically adds the appropiate hardware to the pipeline and instruction decoder Automatically adds the instruction to the compiler, assembler, debugger, linker, and ISS

Viterbi Butterfly Function Unit Path Metrics (256 bits of state, loaded in two instructions) > > > > > > > > State Metric (8x16) Vector Unaugmented processor needs 42 cycles/butterfly Viterbi function unit ~11K gates Augmented processor needs 0.16 cycles/butterfly 250x speedup 128-Bit Data Bus Address Generator 16 Next Address

SIMD SAD Function Unit for Video Unaugmented processor needs 641M cycles to compute SAD for QCIF image @ 15fps Augmented processor computes SAD for 16 pixels/cycle and needs 14M cycles to compute SAD for QCIF image @ 15fps 46x speedup

I/O Just as Important as Computation: Think Outside the Bus Instruction 1 Data Data 2 Instruction Bus IF Bus IF Bus I/O Device Local Shared I/O Device

Where did the Bus Originate? World s First Commercial Microprocessor The Intel 4004, Circa 1971 16 pins necessitating a multiplexed bus And processors have been pin-limited ever since

Rise in Packaged Microprocessor Pin Count Over Time

Wide Interconnect and Wire Density: What s Practical? Do the Math: 8-10 Metal Layers, ITRS 2006 wire spacing 1mm 90nm: 100,000 wires/square mm 1mm Layer 1 Layer 2 65nm: Almost 200,000 wires/square mm Layer 3 Layer 5 Layer 4 Layer 6 Layer 7 Layer 8

Keys to Efficient MP Flexible range of system-interconnect topologies Shared Bus On-chip Routing Network bus Input Device Slave Output Device Slave Slave Slave Routing Node Routing Node Routing Node Routing Node Routing Node Routing Node Routing Node Routing Node Routing Node Global Slave Global I/O Slave Global I/O Slave Cross-Bar Input Device Slave Output Device Slave Slave Slave Application-specific Output Device Slave Dual-Port Queue bus Queue Input Device Slave Global Slave Queue

Getting Your Data Off of the Bus: Dual-Ported Instruction 1 Data Local IF Local Shared Dual Port Data Local IF 2 Instruction

Instantaneous Data Teleportation: Direct-Port Connection Data 1 Instruction Data 2 Instruction Extension Data Register Data Register Extension

Fast, Wide Data Transfers: FIFO-Queue Connection Data 1 Instruction Data 2 Instruction Extension Full Pop Extension Push Empty

Multiple Queue Destinations Producer Address Decode Consumer 1 Consumer 2 Consumer 3 Consumer 4

True Many-Core System-on-Chip 192 Xtensas Per Chip in Cisco CRS-1 Terabit Router Complete 32-bit processor: 25K gates 12 processors per cluster 16 processor clusters per chip 192 Xtensa network-processor cores per Silicon Packet Up to 400,000 processors per system

Complex Consumer Products EPSON s REALOID-based Printers EPSON s printer controller chip with six heterogeneous, asymmetric, configurable VLIW cores legacy controller I/O EPSON PM-D870 PLL DDR I/F Xtensa LX (#1) DLL DLL PLL Xtensa LX (#2) L2 cache DLL DDR I/F USB Device Xtensa LX (#3) Debug MEMC USB Host/ Device Xtensa LX (#5) Xtensa LX (#6) ADC Xtensa LX (#4) ENG BG Epson REALOID IC Block Layout Controller SCN

Data-plane is More than DSP A Recent Example: Server Engines BladeEngine From EETimes: July 31, 2007 Next generation high-volume server chipset -based acceleration of TCP/IP and network/storage protocols 8 configurable data processors ARM management processor PCI-E x8 PCI-E Host Interface 32 Protected Domains Upper-Level Protocol Engine Upper-Level Protocol Engine Upper-Level Protocol Engine On-Chip RAM IPSEC Engine Non-Offloaded Traffic Public Key Accelerator Upper-Level Protocol Engine Upper-Level Protocol Engine Upper-Level Protocol Engine TCP Tx/Rx 10-GbE MAC 10-GbE MAC 1-GbE MAC 1-GbE MAC UART JTAG Management 10/100 Management Port I2C Debug DDR2 DRAM Interface Flash EEPROM Interface Serial EEPROM Interface

Closing Thoughts 1. SOC complexity will continue to increase 2. Programmability essential to flexible system design 3. Many board-level system-design techniques are not appropriate for SOC design 4. Hand-coded HDL with proper verification cannot produce 100M-gate SOCs in any realistic schedule 5. The years of ever-increasing clock rate are behind us 6. Use of multiple processors and high-interconnectivity system topologies is essential to fully exploiting nanometer silicon