Markets Demanding More Performance

Similar documents
Tile Processor (TILEPro64)

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Mission-Critical Space Software For Multi- Core Processors

The Multikernel A new OS architecture for scalable multicore systems

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

Software Defined Modem A commercial platform for wireless handsets

A unified multicore programming model

S2C K7 Prodigy Logic Module Series

A 400Gbps Multi-Core Network Processor

ARISTA: Improving Application Performance While Reducing Complexity

Broadcast-Quality, High-Density HEVC Encoding with AMD EPYC Processors

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Performance study example ( 5.3) Performance study example

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

CCIX: a new coherent multichip interconnect for accelerated use cases

MULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

The S6000 Family of Processors

TILE PROCESSOR APPLICATION BINARY INTERFACE

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

CSE502: Computer Architecture CSE 502: Computer Architecture

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Creating a Scalable Microprocessor:

OCP Engineering Workshop - Telco

EN105 : Computer architecture. Course overview J. CRENNE 2015/2016

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

INT G bit TCP Offload Engine SOC

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Computing: Parallel Architectures Jin, Hai

RZ/G1 SeRieS embedded microprocessors

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

IMPLICIT+EXPLICIT Architecture

2-Port 10G Fiber Network Card with Open SFP+ - PCIe, Intel Chip

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Building and Using the ATLAS Transactional Memory System

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

QuickSpecs. Integrated NC7782 Gigabit Dual Port PCI-X LOM. Overview

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

Agile Hardware Design: Building Chips with Small Teams

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Hybrid Memory Cube (HMC)

Freescale QorIQ Program Overview

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

XMOS Technology Whitepaper

Combining Arm & RISC-V in Heterogeneous Designs

Octopus: A Multi-core implementation

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Netronome NFP: Theory of Operation

Veloce2 the Enterprise Verification Platform. Simon Chen Emulation Business Development Director Mentor Graphics

Challenges for Next Generation Networking AMP Series

Massively Parallel Processor Breadboarding (MPPB)

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

COSC 6374 Parallel Computation. Parallel Computer Architectures

An Intelligent NIC Design Xin Song

Bifrost - The GPU architecture for next five billion

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Projects on the Intel Single-chip Cloud Computer (SCC)

EPYC VIDEO CUG 2018 MAY 2018

Spartan-6 & Virtex-6 FPGA Connectivity Kit FAQ

Cisco Ultra Packet Core High Performance AND Features. Aeneas Dodd-Noble, Principal Engineer Daniel Walton, Director of Engineering October 18, 2018

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Sophon SC1 White Paper

RapidIO.org Update. Mar RapidIO.org 1

Cisco 4000 Series Integrated Services Routers: Architecture for Branch-Office Agility

The Mont-Blanc approach towards Exascale

Programming Multicore Systems

Godson Processor and its Application in High Performance Computers

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Itanium 2 Processor Microarchitecture Overview

Topic & Scope. Content: The course gives

Lecture 20: Distributed Memory Parallelism. William Gropp

BlueGene/L. Computer Science, University of Warwick. Source: IBM

COSC 6374 Parallel Computation. Parallel Computer Architectures

4. Networks. in parallel computers. Advances in Computer Architecture

Multimedia in Mobile Phones. Architectures and Trends Lund

1-Port 10G SFP+ Fiber Optic Network Card - PCIe - Intel Chip - MM

Part IV: 3D WiNoC Architectures

Versal: AI Engine & Programming Environment

High-Performance, Highly Secure Networking for Industrial and IoT Applications

Nighthawk AX8/8-stream AX6000 WiFi Router

An FPGA-Based Optical IOH Architecture for Embedded System

Each Milliwatt Matters

Building blocks for 64-bit Systems Development of System IP in ARM

Chapter 1 Computer System Overview

Simplify System Complexity

Simplify System Complexity

Software Development Using Full System Simulation with Freescale QorIQ Communications Processors

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

An Oracle White Paper February Enabling End-to-End 10 Gigabit Ethernet in Oracle s Sun Netra ATCA Product Family

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Transcription:

The Tile Processor Architecture: Embedded Multicore for Networking and Digital Multimedia Tilera Corporation August 20 th 2007 Hotchips 2007 Markets Demanding More Performance Networking market - Demand for high performance - Services being integrated in the infrastructure - Faster speeds 1Gbps» 2Gbps» 4Gpbs» 10 Gbps - Demand for more services - In-line L4 L7 services, intelligence everywhere - Integration of video with networking Digital Multimedia market - Demand for high performance - H.264 encoding for High Definition - Pre & post processing - Demand for more services - VoD, video conferencing, transcoding, transrating Switches Security Appliances Routers Cable & Broadcast Video Conferencing Surveillance DVR and with power efficiency and programming ease 2

Industry Aggressively Embracing Multicore n s Performance Quad s 16 s Inherent architectural bottlenecks: No scalability Power inefficiency Primitive programming model Dual s 2006 2007 2010 Time 3 Tiled Multicore Closes the Performance Gap s connected by mesh network Unlike buses, meshes scale Resources are distributed improved power efficiency Modular easy to layout and verify Processor S Current Bus Architecture + Switch = Tile 4

Introducing the TILE64 Processor Multicore Performance (90nm) Number of tiles (general purpose cores) On chip distributed cache Operations @ 750MHz (32, 16, 8 bit) On chip interconnect bandwidth Bisection bandwidth 64 5 MB 144-192-384 BOPS 32 Terabits per second 2 Terabits per second Power Efficiency Power per tile Clock speed 170 300 mw 600-1000 MHz I/O and Memory Bandwidth I/O bandwidth Main Memory bandwidth 40 Gbps 200 Gbps Programming The TILE64 chip is shipping today ANSI standard C SMP Linux programming Stream programming 5 TILE64 Processor Block Diagram A Complete System on a Chip DDR2 Memory Controller 0 DDR2 Memory Controller 1 PCIe 0 MAC PHY Serdes XAUI MAC PHY 0 Serdes PROCESSOR Reg File P2 P1 P0 CACHE L2 CACHE L1I L1D ITLB DTLB 2D DMA UART, HPI JTAG, I2C, SPI GbE 0 SWITCH MDN TDN UDN IDN STN Flexible IO GbE 1 Flexible IO PCIe 1 MAC PHY Serdes XAUI MAC PHY 1 Serdes DDR2 Memory Controller 3 DDR2 Memory Controller 2 6

Performance in Networking and Video Performance in networking 10Gbps of SNORT Complete SNORT database All SNORT pre-processors Customer s real world data Open source SNORT software base Performance in video H.264 video encode Encodes 40 CIF video streams @ 30fps Encodes two 720p HD streams @ 30fps PSNR of 35 or more Open source X264 software base Resolution Performance on a single TILE64 Processor vs. other multicore solutions Gbps 12 10 1080P 720P 8 6 4 X 2 X 0 1 11 21 31 41 51 61 1 2 Number of Tiles @ 20Mbps/stream @ 7Mbps/stream SD 8 @ 2Mbps/stream @.1Mbps CIF 40 /stream 0 20 40 Number of video streams per TILE64 processor 7 Key Innovations 1. imesh Network How to scale 2. General purpose cores How to balance core size & number of cores PROCESSOR Reg File P2 P1 P0 MDN TDN UDN IDN STN SWITCH CACHE L2 CACHE L1I L1D ITLB DTLB 2D DMA 5. Multicore Development Environment How to program 3. Multicore Coherent Cache How to obtain both cache capacity and locality PROCESSOR CACHE PROCESSOR CACHE SWITCH SWITCH 4. Multicore Hardwall How to virtualize multicore 8

1- imesh On-Chip Network Architecture SWITCH Distributed resources 2D Mesh Peer-to-peer tile networks 5 independent networks Each with 32-bit channels, full duplex Tile-to-memory, tile-to-tile, and tile-to-io data transfer Packet switched, wormhole routed, point-to-point Near-neighbor flow control Dimension-ordered routing Performance 5 independent networks MDN TDN UDN IDN STN ASIC-like one cycle hop latency 2 Tbps bisection bandwidth 32 Tbps interconnect bandwidth One static, four dynamic IDN System and I/O MDN Cache misses, DMA, other memory TDN Tile to tile memory access UDN, STN User-level streaming and scalar transfer 9 Meshes are Power Efficient [Konstantakopoulos 07] More than 80% power savings over buses 10

Direct User Access to Interconnect Enables stream programming model Compute and send in one instruction Automatic demultiplexing of streams into registers Number of streams is virtualized Streams do not necessarily go through memory for power efficiency add r55, r3, r4 sub r5, r3, r55 Catch all TAG Dynamic Switch Dynamic Switch 11 2- Full-Featured General Purpose s Processor Homogeneous cores 3-way VLIW CPU, 64-bit instruction size SIMD instructions: 32, 16, and 8-bit ops Instructions for video (e.g., SAD) and networking (e.g., hashing) Protection and interrupts Memory L1 cache: 8KB I, 8KB D, 1 cycle latency L2 cache: 64KB unified, 7 cycle latency Off-chip main memory, ~70 cycle latency 32-bit virtual address space per process 64-bit physical address space Instruction and data TLBs Cache integrated 2D DMA engine Switch in each tile Runs SMP Linux 7 BOPS/watt PROCESSOR Reg File P2 P1 SWITCH P0 MDN UDN STN TDN IDN CACHE L2 CACHE L1I L1D ITLB DTLB 2D DMA 12

How to Size a KILL Rule for Multicore Kill If Less than Linear Increase resource size within a core only if for every 1% increase in core area there is at least a 1% increase in core performance Insight: For parallel applications, multicore performance can increase in proportion to the increase in area as more cores are added Leads to power-efficient multicore design 13 E.g., Kill Rule for Cache Size in Video Codec Multicore 2% increase Multicore 4% increase Multicore Area=1 IPC=0.04 512B Cache Area=1.03 IPC=0.17 100 s 97 s 93 s Chip IPC=4 Chip IPC=17 Chip IPC=23 Multicore Area=1.15 IPC=0.29 8KB 325% increase 14% increase 7% increase 7% increase Multicore Area=1.31 IPC=0.31 16KB 2KB 47% increase 16% increase Area=1.07 IPC=0.25 Multicore Area=1.63 IPC=0.32 32KB 4KB 87 s 76 s 61 s Chip IPC=25 Chip IPC=24 Chip IPC=19 14

3- Distributed Coherent Caching L2 il1 dl1 Each tile has local L1 and L2 caches Combined L2 caches of all tiles act as distributed 4MB L3 cache Low Latency of local L1 and L2 caches Capacity of large distributed L3 cache Caches are coherent, enabling running SMP Linux L3 15 4- Multicore Hardwall Technology for Protection and Virtualization The protection and virtualization challenge Multicore interactions make traditional architectures hard to debug and protect Memory based protection will not work with direct IO interfaces and messaging Multiple OS s and applications exacerbate this problem Multicore Hardwall technology Protects applications and OS by prohibiting unwanted interactions Configurable to include one or many tiles in a protected area OS1/APP1 OS2/APP2 OS1/APP3 16

Multicore Hardwall Implementation OS1/APP1 Switch data valid OS2/APP2 OS1/APP3 HARDWALL_ENABLE Switch data valid 17 5- Multicore Software Tools and Programming Arguably biggest multicore challenge Multicore software tools challenge Current tools are primitive use single process based models E.g., how do you single-step an app spread over many cores Many multicore vendors do not even supply tools Multicore programming challenge Key tension between getting up and running quickly using familiar models, while providing means to obtain full multicore performance How do you program 100 1000 cores? Intel Webinar likens threads to the Assembly of parallel programming but familiar and still useful in the short term for small numbers of cores Need a way to transition smoothly from today s programming to tomorrow s 18

Tilera s Approach to Multicore Tools: Spatial Views and Collectives Grid view Provides spatial view For selecting single process or region Eclipse based Multicore Debugger GDB standard based -- familiar Aggregate control and state display Whole-application model for collective control Low skid breakpointing of all related processes Multicore Profiler Collective stats Aggregate over selected tiles 19 Gentle Slope Programming Model Gentle slope programming philosophy Facilitates immediate results using off-the-shelf code Incremental steps to reach performance goals Three incremental steps Compile and run standard C applications on a single tile Run the program in parallel using standard SMP Linux models pthreads or processes Use stream programming using ilib a light-weight sockets-like API T T T M E M M E M 20

High Performance in Small Form Factor Example System Design Market moving to intelligent network systems Growing need for in-line L4- L7 services Real-time protection against threats Tile Processor enables Integrated in-line performance -- multiple apps at 1 to 10 Gbps of performance Glueless interface with leading switch vendors RJ45 Magnetics PORT1 PORT8 PORT1 PORT8 PORT1 PORT8 PORT Octal PHY Octal PHY Octal PHY 10 Gig PHY (stack) XAUI PHY RST PHY INT 24 Port L2/L3 Ethernet Switch RST INT Intelligent Switch Design Reset/Interrupt Controller XAUI XAUI UPLINK 21 Tilera: World Class Company 64 people, proven veteran team 150+ total tape-outs for revenue 15 years average experience Proven leadership team Over 150 years combined experience 7 start-up companies founded 40+ patents pending Bessemer, Walden, Columbia, VTA (TSMC) Series B funding closed in February 07 Rich Heritage Industry Involvement One of top 60 emerging startups Named finalist for Red Herring 100 22

Summary Current multicores face software and scalability challenges imesh network based Tile Processor scales to many cores Gentle slope programming offers: Convenience of SMP Linux/pthreads programming model Performance scalability through streaming channels TILE64 silicon, software tools, and applications deployed in customers systems 23 Additional Information PSNR: MDN: UDN: TDN: IDN: STN: Peak signal to noise ratio Memory dynamic network User dynamic network Tile dynamic network I/O dynamic network Static network The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile Processor, TILE64, Embedding Multicore, Multicore Development Environment, Gentle Slope Programming, ilib, imesh and Multicore Hardwall. All other trademarks and/or registered trademarks are the property of their respective owners. Copyright 2007 Tilera Corporation 24