A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

Similar documents
There s STILL plenty of room at the bottom! Andreas Olofsson

An Alternative to GPU Acceleration For Mobile Platforms

Parallella: A $99 Open Hardware Parallel Computing Platform

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)

Zynq-7000 All Programmable SoC Product Overview

SoC Platforms and CPU Cores

VLSI Design Automation

Lecture 5: Computing Platforms. Asbjørn Djupdal ARM Norway, IDI NTNU 2013 TDT

Versal: AI Engine & Programming Environment

VLSI Design Automation. Calcolatori Elettronici Ing. Informatica

VLSI Design Automation. Maurizio Palesi

IBM "Broadway" 512Mb GDDR3 Qimonda

A Scalable Processor Architecture for the Next Generation of Low Power Supercomputer. PRACE Workshop, October 2010 Andreas Olofsson

The Epiphany Manycore Architecture Seminar at Halmstad Högskola March 6,

ARM Cortex-A9 ARM v7-a. A programmer s perspective Part1

The World Leader in High Performance Signal Processing Solutions. DSP Processors

ECE 471 Embedded Systems Lecture 2

Product Series SoC Solutions Product Series 2016

Ten (or so) Small Computers

1TOPS/W Software Programmable Media Processor. David Moloney, CTO, Movidius 19 August 2011

The Nios II Family of Configurable Soft-core Processors

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Copyright 2016 Xilinx

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

The RM9150 and the Fast Device Bus High Speed Interconnect

Introduction to Sitara AM437x Processors

Designing Embedded Processors in FPGAs

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

Massively Parallel Processor Breadboarding (MPPB)

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC

Simplify System Complexity

Big Data mais basse consommation L apport des processeurs manycore

March 10 11, 2015 San Jose

Ettus Research Update

Tile Processor (TILEPro64)

ARM Processors for Embedded Applications

Embedded Systems 1: Course Presentation

KeyStone C665x Multicore SoC

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Multimedia in Mobile Phones. Architectures and Trends Lund

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Place Your Logo Here. K. Charles Janac

Outline Marquette University

Embedded Systems: Hardware Components (part II) Todor Stefanov

pcduino V3B XC4350 User Manual

AT-501 Cortex-A5 System On Module Product Brief

KiloCore: A 32 nm 1000-Processor Array

STM32F7 series ARM Cortex -M7 powered Releasing your creativity

FABRICATION TECHNOLOGIES

MYC-C7Z010/20 CPU Module

Chapter 6 Storage and Other I/O Topics

VLSI Design Automation

Software Defined Modem A commercial platform for wireless handsets

ECE 471 Embedded Systems Lecture 3

Platform-based Design

«Real Time Embedded systems» Multi Masters Systems

Simplify System Complexity

LinkSprite Technologies,.Inc. pcduino V2

Green Multicore. David Moloney, CTO, Movidius 24 November 2011

Product Technical Brief S3C2413 Rev 2.2, Apr. 2006

Course Introduction. Purpose: Objectives: Content: Learning Time:

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Product Technical Brief S3C2412 Rev 2.2, Apr. 2006

Dr. Yassine Hariri CMC Microsystems

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Embedded Systems: Architecture

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Simplifying FPGA Design for SDR with a Network on Chip Architecture

PRODUCT PREVIEW TNETV1050 IP PHONE PROCESSOR. description

Chapter 1 Computer System Overview

RISC-V based core as a soft processor in FPGAs Chowdhary Musunuri Sr. Director, Solutions & Applications Microsemi

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

Hercules ARM Cortex -R4 System Architecture. Processor Overview

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Zynq Architecture, PS (ARM) and PL

Growth outside Cell Phone Applications

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Low-Power Processor Solutions for Always-on Devices

Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform

Design Techniques for Implementing an 800MHz ARM v5 Core for Foundry-Based SoC Integration. Faraday Technology Corp.

Final Presentation. Network on Chip (NoC) for Many-Core System on Chip in Space Applications. December 13, 2017

ECE 471 Embedded Systems Lecture 2

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Cannon Mountain Dr Longmont, CO LS6410 Hardware Design Perspective

Systems in Silicon. Converting Élan SC400/410 Design to Élan SC520

Introduction to GPU computing

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Getting to Work with OpenPiton. Princeton University. OpenPit

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

SPEAr: an HW/SW reconfigurable multi processor architecture

Third Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Keystone Architecture Inter-core Data Exchange

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Beyond Moore. Beyond Programmable Logic.

Altera SDK for OpenCL

The Challenges of System Design. Raising Performance and Reducing Power Consumption

MYD-C7Z010/20 Development Board

EMAC SoM Presentation.

Transcription:

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1

Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company to successfully crowd source project 2. First mobile processor with an open source OpenCL TM SDK 2

The Future is Efficient Heterogeneous Open.. Task Parallel Grande Challenges Ahead Rebuild the computer ecosystem! Rewrite billions of lines of code! Retrain millions of programmers! Rewrite the education curriculum! 3

Our Vision: True Heterogeneous Computing SYSTEM ON CHIP BIG BIG BIG BIG FPGA GPU Analog 100 s of small Epiphany RISC s/dsps 4

Architecture Comparison Technology FPGA DSP GPU Manycore Process 28nm 40nm 28nm 32nm 28nm Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++ OCL/C/C++ Area (mm^2) 590 108 294 216 10 Chip Power (W) 40 22 135 130 2 s n/a 8 16 4 64 Max GFLOPS 1500 160 3000 115 102 GHz * Cores n/a 12 16 14.4 51.2 Compile Time Hours Minutes Minutes Minutes Minutes L1 Memory 6MB 512KB 2.5MB 256KB 2MB Efficiency is everything Peak performance means very little No magic bullet! 5

Epiphany: Massive Task Parallelism Coprocessor to ARM/Intel 25mW per core C/C++ programmable 6

Programming Models MODEL#1 TASK QUEUE MODEL Great for up to 2GFLOPS Supports standard C/C++ Cloud on a chip Task2 Task1 X86/ARM/FPGA Host Task4 Task3 MODEL #2 DATA PARALLEL MODEL opencl programmable Easy integration of C/C++ openmp/mpi roadmap Task1 X86/ARM/FPGA Host 7

Architecture Tradeoffs Feature In Out Why Single Precision Floating Point X Programming Efficiency 64 Entry Register File X GCC Byte Addressable X GCC 64bit load/store X Data movement is king Dual Issue In Order Scheduling 4 Bank Local Memory X Multicore data movement Interrupts, breakpoints, timers X Inexpensive Hardware Cache X Too expensive, software managed Double Precision Floating Point X Overkill for initial market VLIW X GCC, Complexity Instr: Not, Mask,Rotate, Add Carry, Ones X Not important Data Type ISA Orthogonally. (Signed,unsigned) & (8b,16b,32b) X X Expensive. Focus was on floating point 8

Network On Chip Tradeoffs Feature In Out Why 64 Bit Transfer /Cycle X Maximize efficiency Send address on every cycle X Wires are free, simplicity Round robin arbitration X Easy, good enough Distributed routing X Scalable Robust flow control X Must have Bidirectional Mesh X Well known Address mapped packet switching X Ease of Use Single cycle message transfer X Short messaging important 3 separate Networks X No deadlock, QOS Extensive QOS features X Deterministic traffic Torus wraparound connections X Too expensive in CMOS Circuit switching X Not general purpose Large routing buffers X Too expensive Multilayer Mesh X Too expensive (for now) 9

Epiphany Implementation Methodology Scalable Epiphany IV IO PADS 24 hrs from RTL to GDS Reusable tiles No long wires Pattern density Fault tolerant 0,0 1,0 0,1 1,1 L O G I C S R A M G L U E Thermal balancing Easy clocking 7,7 10

Epiphany IV Area Breakdown IO Memory FPU Register File Everything Else Memory/RF/FPU >65% of silicon die Make every transistor count MIMD efficiency good enough 11

Epiphany IV Specifications 64 s IEEE Floating Point (SP) 800 MHz Max Frequency 100 GFLOPS Performance 6.4 GB/s IO BW 200 GB/s peak NOC BW 1.6 TB/sec on chip memory BW 25 Billion Messages/sec 2MB on chip memory 10 mm 2 total silicon area in 28nm 2 Watt total chip power 324 ball 15x15mm BGA Sampling since July, 2012 12

Epiphany System Examples Programmable and flexible SDIO FLASH USB ARM SOC SDRAM ETH Etc.. Easy to develop for SMALL FPGA 2GB/s elink Efficient and powerful E16G301 E16G301 Scalable E16G301 E16G301 FPGA FPGA ADC JESD204 DMA elink E16G301 E16G301 elink DMA JESD204 DAC ADC JESD204 DMA elink elink DMA JESD204 DAC 4GB/s 4GB/s ADC JESD204 DMA elink elink DMA JESD204 DAC ADC JESD204 DMA elink E16G301 E16G301 elink DMA JESD204 DAC 13

Parallella Computing Project Open (and free ): Documentation Board design files Drivers Software Tools Accessible (NO NDAs!) $100 entry point ~4000 devs signed up in 4 weeks GPIO USB ZYNQ (ARM) E64 usd HDMI IO IO Rj45 1GB SDRAM GPIO USB 14

Parallella Architecture SD CARD I2C UART Ethernet USB OTG USB 2.0 MIO Dual Core ARM A9 (800MHz) Zynq Hard Zynq FPGA Off Chip AXI BUS AXI MASTER AXI SLAVE AXI MASTER MEM CTRL elink Glue Logic GPIO HDMI Controller O/S DRAM Epiphany Daughter Card Sandbox SHARED DRAM 15

Experimental Features On The Way Network Traffic and Congestion Monitors Multicore Hardware Synchronization Active Message Multicore Breakpoint Hardware Loops Multicast Network Transactions Already inside Epiphany III/IV silicon, but need more testing! 16

MIMD Manycore IS the Future! ~10mm ~10mm 1024 s 32KB/core 0.2W-20W 16K s 1MB/core(3D) 0.2W-20W 2012 2022 17