Tile Processor (TILEPro64)

Similar documents
Markets Demanding More Performance

Performance study example ( 5.3) Performance study example

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

Mission-Critical Space Software For Multi- Core Processors

Projects on the Intel Single-chip Cloud Computer (SCC)

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

MULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

The Multikernel A new OS architecture for scalable multicore systems

TILE PROCESSOR APPLICATION BINARY INTERFACE

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Versal: AI Engine & Programming Environment

Flexible Architecture Research Machine (FARM)

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Netronome NFP: Theory of Operation

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011

S2C K7 Prodigy Logic Module Series

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

Creating a Scalable Microprocessor:

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

A 400Gbps Multi-Core Network Processor

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

A Next Generation Home Access Point and Router

Sophon SC1 White Paper

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

On-Chip Debugging of Multicore Systems

SoC Platforms and CPU Cores

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

Industry Collaboration and Innovation

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

IBM Network Processor, Development Environment and LHCb Software

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Copyright 2016 Xilinx

XPU A Programmable FPGA Accelerator for Diverse Workloads

Parallel Computing: Parallel Architectures Jin, Hai

KeyStone II. CorePac Overview

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

EMBEDDED SYSTEMS WITH ROBOTICS AND SENSORS USING ERLANG

RISC-V based core as a soft processor in FPGAs Chowdhary Musunuri Sr. Director, Solutions & Applications Microsemi

ECE 571 Advanced Microprocessor-Based Design Lecture 24

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

An Intelligent NIC Design Xin Song

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Designing with NXP i.mx8m SoC

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

FPQ6 - MPC8313E implementation

OCP Engineering Workshop - Telco

Building blocks for 64-bit Systems Development of System IP in ARM

Agilio CX 2x40GbE with OVS-TC

6.1 Multiprocessor Computing Environment

Chapter Seven Morgan Kaufmann Publishers

Toward a unified architecture for LAN/WAN/WLAN/SAN switches and routers

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

BlazePPS (Blaze Packet Processing System) CSEE W4840 Project Design

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

TRACE32. Product Overview

Introduction to Sitara AM437x Processors

Industry Collaboration and Innovation

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Zynq-7000 All Programmable SoC Product Overview

High Performance Packet Processing with FlexNIC

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

CEC 450 Real-Time Systems

Copyright 2017 Intel Corporation

! Readings! ! Room-level, on-chip! vs.!

Octopus: A Multi-core implementation

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

XMOS Technology Whitepaper

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Design, Verification and Emulation of an Island-Based Network Flow Processor

QuickSpecs. HP Z 10GbE Dual Port Module. Models

SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Simultaneous Multithreading on Pentium 4

QorIQ T4 Family of Processors. Our highest performance processor family. freescale.com

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Ettus Research Update

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Simplifying FPGA Design for SDR with a Network on Chip Architecture

ECE 574 Cluster Computing Lecture 23

PE3G4TSFI35P Quad Port Fiber Gigabit Ethernet PCI Express Time Stamp Server Adapter Intel Based

COSC 6385 Computer Architecture - Multi Processor Systems

Next Generation Enterprise Solutions from ARM

Software Design Challenges for heterogenic SOC's

Product Technical Brief S3C2416 May 2008

ARMv8-A Software Development

GPUs have enormous power that is enormously difficult to use

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Transcription:

Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth (/s) Clock speed (MHz) Power Typical power -7 device (W) I/O and Memory Ethernet bandwidth PCIe interfaces DDR2 bandwidth (peak Gbps) TILEPro64 64 5.6 Yes w/ DDC 221/166 38 700, 866 17-21 2 XAUI, 2GbE 2 x 4-lanes 200 TILE64 in 2007 TILEPro64 and TILEPro36 in 2008 TILEGx36 and TILEGx100 in 2011 Flexible I/O UART JTAG SPI, I2C PCIe 0 PCIe 1 DDR2 Controller 0 DDR2 Controller 3 DDR2 Controller 1 DDR2 Controller 2 TILEPro64 Block Diagram Quanta Computer 512 cores Tilera systems XAUI 0 GbE 0 GbE 1 XAUI 1 2

Tiled Multicore Concept Scales to large numbers of cores Modular design and verify 1 tile Power efficient Short wires plus locality opts CV 2 f Chandrakasan effect, more cores at lower freq and voltage CV 2 f Processor S + = Tile 3 Remember the 3 P s? Performance challenge How to scale with number of cores mesh network Distributed scheme for cache coherence User-level network access Power efficiency challenge Distributed architecture Mesh network Local caches Programming challenge Cache coherence General purpose, full featured cores Fine-grain protection 4

Key Ideas 1. General purpose cores Standard OS and programming 2. imesh Network How to scale and be energy efficient 5. Multicore Development Environment How to program 3. Multicore Dynamic Distributed Cache How to achieve cache coherence and run standard software OS1/APP1 OS2/APP2 OS1/APP3 4. Fine grain protection How to virtualize multicore 5 1 What s in a Processor Each core is a complete computer 3-way VLIW CPU Designed for low power 200mW per core SIMD instructions: 32, 16, and 8-bit ops Instructions for video (e.g., SAD) and networking Protection and interrupts Single core performance roughly the same as a modern MIPS or ARM core Memory L1 cache: 8KB I, 8KB D, 1 cycle latency L2 cache: 64KB unified, 7 cycle latency 32-bit virtual address space per process 64-bit physical address space Instruction and data TLBs Cache integrated 2D DMA engine in each tile ISA allows direct processor access to networks Runs SMP Linux Runs off-the-shelf open-source C/C++, pthreads programs Register File Three Execution Pipelines 16K L1-I 8K L1-D Cache 64K L2 I-TLB D-TLB 2D DMA 6

2- On-Chip Networks Distributed resources 2D Mesh peer-to-peer tile networks (named imesh TM ) 6 independent networks Each with 32-bit channels, full duplex Tile-to-memory, tile-to-tile, and tile-to-io data transfer Packet switched, wormhole routed, point-to-point Near-neighbour flow control, dimension-ordered routing Performance and energy efficiency One cycle hop latency 2 Tbps bisection bandwidth 32 Tbps interconnect bandwidth Low power 6 independent networks Five dynamic IDN System and I/O MDN Cache misses, DMA, other memory TDN, VDN Tile to tile memory access and coherence UDN One static, scalar operand network STN User-level streaming and scalar transfer SWITCH MDN TDN UDN IDN VDN STN for scalability and power efficiency 7 Direct User Access to Interconnect Enables stream programming model Compute and send in one instruction: add $udno, $r3, $udni Automatic demultiplexing of streams into registers Number of streams is virtualized Streams do not necessarily go through memory for power efficiency add r55, r3, r4 sub r5, r3, r55 Catch all TAG Dynamic Dynamic 8

Mesh Power Efficiency [Konstantakopoulos 07] 80% power savings over buses 9 3 Coherent On-Chip Cache System Distributed cache Each tile has local L1 and L2 cache Aggregate of L2 serves as a globally shared L3 Dynamic Distributed Cache (DDC ) Hardware based cache coherence Hardware tracks sharers, invalidates stale copies One or multiple coherency domains Dedicated networks to manage coherency Coherent direct-to-cache I/O Header/packet delivered directly to tile caches Cache coherent delivery Significant DRAM bandwidth and latency reduction Flexible Flexible I/O I/O UART UART JTAG JTAG SPI, SPI, I2C I2C PCIe PCIe 0 PCIe PCIe 1 DDR2 DDR2 Controller 0 DDR2 DDR2 Controller 3 DDR2 DDR2 Controller 1 Completely Coherent System tile a tile b tile h DDR2 DDR2 Controller 2 XAUI XAUI 0 GbE GbE 0 GbE GbE 1 XAUI XAUI 1 10

4 Configurable Fine Grain Protection (CFP) Full Stack Linux with Hypervisor TLB Access DMA Engine User Network I/O Network Timers Key: 0 User Code 1 OS 2 Hypervisor 3 Hypervisor Debugger 11 Multicore Hardwall Technology for Virtualization and Protection The virtualization and protection challenge Multicores need to run multiple OS s and applications in cloud environments OS s must be protected from each other I/O and other shared resources must be virtualized Multicore Hardwall technology Protects applications and OS by prohibiting unwanted interactions Configurable to include one or many tiles in a protected area Supported by Tilera hypervisor running on all the tiles OS1/APP1 OS2/APP2 OS1/APP3 12

Multicore Hardwall Implementation OS1/APP1 data valid OS2/APP2 OS1/APP3 HARDWALL_ENABLE data valid 13 5 Standard Tools and Software Multicore Development Environment Standards-based tools Standard application stack Standard programming SMP Linux 2.6.26 ANSI C/C++ pthreads Integrated tools SGI compiler Standard gdb gprof Eclipse IDE Innovative tools Multicore debug Multicore profile Application layer Open source apps Standard C/C++ libs Operating System layer 64-way SMP Linux Zero Overhead Linux Bare metal environment Hypervisor layer Virtualizes hardware I/O devices drivers Load balancer Applications Linux libraries Applications libraries Operating System Linux kernel and kernel drivers Hypervisor Virtualization and high speed I/O drivers TILE hardware Tile Tile Tile Tile 14

Multiple Software Environments to Meet Diverse Needs of Embedded and Cloud Systems Standard SMP Linux Standard Linux environment with processes and threads Ideal for applications and control plane code requiring operating system services Open source applications work out of the box Standard SMP Linux with Zero Overhead Linux (ZOL) Zero Linux overhead (Eliminates OS interrupts, timer ticks, etc..) Transparent to programmer - no software change required For high performance data-plane applications not requiring OS services Bare metal environment Full control of the hardware on up to 64 tiles No operating system or hypervisor layers For embedded applications requiring fine grain control of memory, and IO Hybrid environment Using 2 or all three of the above models Each environment can be run on one or more Tiles Ideal for customers aggregating data plane and control plane code on one chip 15 Parallel Programming using Standard Models 64-way SMP Linux Single system image across all tiles Standard pthreads API pthread_create() Shared memory model by default Synchronize using mutexes and locks Packet stream Load Balancer Deep packet inspection Histogram Standard Linux processes fork(), exec() Separate address space Share memory: mmap(), mspaces Communicate: Pipes / local sockets Gentle slope programming optimizations using Linux extensions Control memory location and distribution Control thread scheduling and location New models and further optimizations using TMC library (Tile Multicore Components) PCIe PCIe 0 MAC MAC PHY PHY Serdes UART, HPI HPI JTAG, I2C, I2C, SPI SPI Flexible IO IO PCIe PCIe 1 MAC MAC PHY PHY Serdes DDR2 Memory Controller 0 DDR2 Memory Controller 3 DDR2 Memory Controller 1 HST LB DDR2 Memory Controller 2 XAUI MAC MAC PHY PHY 0 Serdes GbE GbE 0 GbE GbE 1 Flexible IO IO XAUI MAC MAC PHY PHY 1 Serdes 16

Scaling Up: TileGx100 in 2011 MiCA Misc I/O 3 PCIe Interfaces Flexible I/O MiCA Memory Controller Memory Controller Memory Controller Memory Controller mpipe Network I/O Eight 10Gb -or- 32 1Gb -or- Two 40Gb 100 general-purpose cores Runs SMP Linux Standard programming 1.25GHz 1.5GHz Full 64-bit processors 32 MBytes total cache 546 Gbps memory BW 200 Tbps imesh BW 80-120 Gbps packet I/O 80 Gbps PCIe I/O Wire-speed packet engine 120Mpps MiCA engines: 40 Gbps crypto 20 Gbps compress 17 Scaling Down: TILEGx16 MiCA UART x2, x2, USB x2, x2, JTAG, I 2 I 2 C, C, SPI SPI PCIe 2.0 2.0 4-lane PCIe 2.0 2.0 4-lane Memory Controller (DDR3) mpipe 4x GbE SGMII 10 10 GbE XAUI 4x GbE SGMII 10 10 GbE XAUI 16 Processor s 1.0 &1.25 GHz speeds Full 64-bit processors 5.2 MBytes total cache 200 Gbps memory BW 20 Tbps imesh BW 24 Gbps total packet I/O 2 ports 10GbE (XAUI) 12 ports 1GbE (SGMII) 32 Gbps PCIe I/O PCIe 2.0 2.0 4-lane Flexible I/O I/O Memory Controller (DDR3) 4x GbE SGMII Wire-speed packet engine 30Mpps MiCA engine: 10 Gbps crypto 5 Gbps compress & 5 Gbps decompress Midrange 36 core part also announced 18