A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators"

Similar documents
RISC-V Rocket Chip SoC Generator in Chisel. Yunsup Lee UC Berkeley

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

Raven: A CHISEL designed 28nm RISC-V Vector Processor with Integrated Switched- Capacitor DC-DC Converters & Adaptive Clocking

A Retrospective on Par Lab Architecture Research

Hwacha V4: Decoupled Data Parallel Custom Extension. Colin Schmidt, Albert Ou, Krste Asanović UC Berkeley

Energy-Efficient RISC-V Processors in 28nm FDSOI

FABRICATION TECHNOLOGIES

BERKELEY PAR LAB. RAMP Gold Wrap. Krste Asanovic. RAMP Wrap Stanford, CA August 25, 2010

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)


Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

Custom Silicon for all

Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models

An Overview of Standard Cell Based Digital VLSI Design

Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

PACE: Power-Aware Computing Engines

A 1.5GHz Third Generation Itanium Processor

RISC-V. Palmer Dabbelt, SiFive COPYRIGHT 2018 SIFIVE. ALL RIGHTS RESERVED.

ADVANCED ELECTRONIC SOLUTIONS AVIATION SERVICES COMMUNICATIONS AND CONNECTIVITY MISSION SYSTEMS

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

An overview of standard cell based digital VLSI design

Chisel: Constructing Hardware In a Scala Embedded Language

Creating a Scalable Microprocessor:

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

ASIC Design of Shared Vector Accelerators for Multicore Processors

RISECREEK: From RISC-V Spec to 22FFL Silicon

EITF35: Introduction to Structured VLSI Design

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Each Milliwatt Matters

Lab. Course Goals. Topics. What is VLSI design? What is an integrated circuit? VLSI Design Cycle. VLSI Design Automation

All About the Cell Processor

More Course Information

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

SiFive Freedom SoCs: Industry s First Open-Source RISC-V Chips

CS 152, Spring 2011 Section 8

Trends in the Infrastructure of Computing

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Technology Trends Presentation For Power Symposium

Outline Marquette University

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

ProtoFlex: FPGA Accelerated Full System MP Simulation

CS 152 Laboratory Exercise 5 (Version B)

ARM Processors for Embedded Applications

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

Adding SRAMs to Your Accelerator

CS 152 Laboratory Exercise 5 (Version C)

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

CS 152 Laboratory Exercise 5

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

A Hardware Accelerator for Computing an Exact Dot Product. Jack Koenig, David Biancolin, Jonathan Bachrach, Krste Asanović

Sorry for missing last week s lecture! To talk about some silliness in MPI-3. A more complete view

CS250 VLSI Systems Design Lecture 9: Memory

Zynq-7000 All Programmable SoC Product Overview

KiloCore: A 32 nm 1000-Processor Array

Introduction to ICs and Transistor Fundamentals

Fundamentals of Computer Design

Power, Performance and Area Implementation Analysis.

Agile Hardware Design: Building Chips with Small Teams

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Itanium 2 Processor Microarchitecture Overview

ECE 471 Embedded Systems Lecture 2

45-year CPU Evolution: 1 Law -2 Equations

Factors which influence in many core processors

Configurable and Extensible Processors Change System Design. Ricardo E. Gonzalez Tensilica, Inc.

VLSI Digital Signal Processing

ECE 5745 Complex Digital ASIC Design Course Overview

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

A 1-GHz Configurable Processor Core MeP-h1

Spiral 2-8. Cell Layout

VLSI Design Automation

ECE 571 Advanced Microprocessor-Based Design Lecture 24

Jack Kang ( 剛至堅 ) VP Product June 2018

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Do we need more chips (ASICs)?

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

New 130nm Itanium 2 Processors for 2003

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Microprocessor Trends and Implications for the Future

Natalie Enright Jerger, Jason Anderson, University of Toronto November 5, 2010

GreenDroid: An Architecture for the Dark Silicon Age

The Design of the KiloCore Chip

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

VLSI Design Automation. Calcolatori Elettronici Ing. Informatica

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Lecture 41: Introduction to Reconfigurable Computing

The Nios II Family of Configurable Soft-core Processors

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

Welcome to CS250 VLSI Systems Design

IBM's POWER5 Micro Processor Design and Methodology

KeyStone II. CorePac Overview

TOPIC : Verilog Synthesis examples. Module 4.3 : Verilog synthesis

L évolution des architectures et des technologies d intégration des circuits intégrés dans les Data centers

Computer Architecture

EITF35: Introduction to Structured VLSI Design

6.375 Complex Digital System Spring 2006

Understanding Peak Floating-Point Performance Claims

Transcription:

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W ISC-V Processor with Vector Accelerators" Yunsup Lee 1, Andrew Waterman 1, imas Avizienis 1,! Henry Cook 1, Chen Sun 1,2,! Vladimir Stojanovic 1,2, Krste Asanovic 1!! 1 University of California, Berkeley! 2 Massachusetts Institute of Technology! 1!

Upheaval in Computer Design" Cost per Million Gates ($)! Vdd HP! Cost per Million Gates ($)" 0.045! 0.04! 0.035! 0.03! 0.025! 0.02! 0.015! 0.01! 0.005! 0! 0.0401! 0.0282! Moore s Law (cost/transistor) over?! Dennard scaling over (Vdd ~fixed)! 0.0194! 0.014! 0.0142! 0.0162! 90nm! 65nm! 40nm! 28nm! 20nm! 16/14nm! Energy efficiency constrains everything! Incorporate specialized and heterogeneous accelerators into general-purpose processors! Write processor generators to express a design space, and do vertically-integrated design space exploration extensively! Source! [1] Why migration to 20nm bulk CMOS and 16/14nm FINFETS is not the best approach for semiconductor industry, IBS, Handel Jones, 2014.! [2] International Technology oadmap of Semiconductors (ITS)! 5! 4.5! 4! 3.5! 3! 2.5! 2! 1.5! 1! 0.5! 0! Vdd HP" 2!

3mm X 6mm Chip Fabricated in 45nm SOI 75m+ transistors! Dual-Core ISC-V Processor with Vector Accelerators! 1MB SAM Memory Structure for Testing! Monolithically-Integrated Silicon Photonic Links" Transmitter: Wade OFC 14! eceiver: Georgas VLSI 14!! 3!

Chip Architecture" 3mm 2.8mm VF L1D$ 6mm 1MB SAM Array Core Logic ocket Scalar Core L1I$ L1VI$ Hwacha Vector Accelerator ocket Scalar Core Core Hwacha Vector Accelerator 1.1mm 16K L1I$ 32K L1D$ 8KB L1VI$ 16K L1I$ 32K L1D$ 8KB L1VI$ Dual-Core ISC-V Vector Processor FPGA FSB/ HTIF Arbiter Coherence Hub 1MB SAM Array Arbiter 4!

ISC-V is a new, open, and completely free general-purpose ISA! Developed at UC Berkeley! ISC-V designed to be flexible and extensible! Better integrate accelerators with host cores! ISC-V software ecosystem! binutils, GCC, Newlib, glibc, GDB, LLVM, Linux, QEMU! External users contributing to ecosystem! 5!

ocket Scalar Core" PC! IF! ID! EX! MEM! WB! ITLB Int.F DTLB PC Gen. I$ Inst. Int.EX D$ Commit Access Decode Access bypass paths omitted for simplicity ocket Pipeline to Hwacha FP.F FP.EX1 FP.EX2 FP.EX3 64-bit 6-stage single-issue in-order pipeline! Design minimizes impact of long clock-to-output delays of compiler-generated AMs! 64-entry BTB, 256-entry BHT, 2-entry AS! MMU supports page-based virtual memory! IEEE 754-2008-compliant FPU! Supports SP, DP FMA with hw support for subnormals! 6!

AM Cortex-A5 vs. ISC-V ocket" Category" AM Cortex-A5" ISC-V ocket" ISA! 32-bit AM v7! 64-bit ISC-V v2! Architecture! Single-Issue In-Order! Single-Issue In-Order 6-stage! Performance! 1.57 DMIPS/MHz! 1.72 DMIPS/MHz! Process! TSMC 40GPLUS! TSMC 40GPLUS! Area w/o Caches! 0.27 mm 2! 0.14 mm 2! Area with 16K Caches! 0.53 mm 2! 0.39 mm 2! Area Efficiency! 2.96 DMIPS/MHz/mm 2! 4.41 DMIPS/MHz/mm 2! Frequency! >1GHz! >1GHz! Dynamic Power! <0.08 mw/mhz! 0.034 mw/mhz! PPA reporting conditions! 85% utilization, use Dhrystone for benchmark, frequency/ power at TT 0.9V 25C, all regular VT transistors! 10% higher in DMIPS/MHz, 49% more area-efficient! 7!

Hwacha Vector Accelerator" L1 VI$ Vector Issue Unit Bank0 Ctrl Bank0 11W SAM Bank1 Ctrl Bank1 11W SAM...... Bank7 Ctrl Bank7 11W SAM 64-bit Integer Multiplier SP/DP Floating-Point Units Vector Memory Unit Shared L1 D$ from ocket int int int ead Ports Write Ports 8!

Bank Execution Diagram" W W W After a 2-cycle initial startup latency, the banked F is effectively able to read out 2 operands/cycle.! 9!

Processor Generators" Express hardware as highly parameterized generators! Helps tune the design under different performance, power, and area constraints! Parameters include:! number of cores! cache sizes, associativity, number of TLB entries, cache-coherence protocol! number of floating-point pipeline stages! width of off-chip I/O, and more! 10!

Writing Generators with Chisel" TL generator written in Chisel! HDL embedded in Scala! Full power of Scala for writing generators! object-oriented programming, functional programming! C++ code! C++ Compiler! Chisel Program! Scala/JVM! FPGA Verilog! ASIC Verilog! Software Simulator! FPGA Tools! FPGA Emulation! ASIC Tools! GDS Layout! 11!

Physical Design Flow" Chisel Source Code! Chisel! TL Code (Verilog)! Synthesis! Place-and-oute! The core is synthesized and place-and-routed independently, and instantiated twice Gate-level Netlist! Formality! Formal Verification! PrimeTime/StarC! Static Timing Analysis! VCS Post-PN! Gate-level Simulation! Signed-Off Design! 12!

Chip esults" Process! Package! Chip Parameters" 45nm SOI CMOS, 11 metal layers! C4 area I/O, flip-chip bonded to PCB! Size! Processor! 2.8mm X 1.1mm! Standard Cells! SAM Bits! 1 Core! 1.37mm X 1.06mm! SAM Array! Processor! 1.1mm X 4mm! 425K (85K flip-flops)! 1 Core! 192K (36K flip-flops)! Processor! 1246K! 1 Core! 621K! Frequency! 1GHz (Nominal), 250MHz-1.3GHz! Voltage! Power! 1V (Nominal), 0.65V-1.2V! 300mW-430mW (Nominal), 40mW-960mW! 13!

Measurement Setup" 45nm ISC-V Vector Processor FSB 150MHz Virtex 6 FPGA Board 1Gbps Ethernet Laptop 512MB DAM only used in basic testing mode 14!

Shmoo Plot of DP GFLOPS/W" unning Double-Precision Matrix Multiplication on Vector Accelerator! Vdd (V) 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 15.8 12.8 10.6 8.7 7.1 5.7 4.6 3.7 3.1 2.6 2.1 1.6 200 16.7 14.0 11.6 9.6 7.9 6.4 5.3 4.3 3.6 3.0 2.4 1.9 250 14.9 12.4 10.3 8.6 7.0 5.8 4.8 4.0 3.3 2.8 2.2 300 15.6 13.0 10.9 9.1 7.5 6.2 5.2 4.3 3.7 3.0 2.4 350 13.6 11.3 9.6 7.9 6.6 5.5 4.7 3.9 3.3 2.6 400 14.1 11.8 9.9 8.3 6.9 5.8 4.9 4.2 3.5 2.8 450 12.2 10.3 8.6 7.2 6.1 5.2 4.4 3.7 3.0 500 12.5 10.5 8.8 7.5 6.3 5.4 4.5 3.8 3.1 550 10.8 9.1 7.7 6.5 5.6 4.7 4.0 3.3 600 11.0 9.3 7.9 6.7 5.8 4.9 4.2 3.3 650 11.2 9.5 8.1 6.9 5.9 5.0 4.3 3.5 700 11.4 9.7 8.3 7.1 6.1 5.2 4.4 3.6 750 9.8 8.4 7.2 6.2 5.3 4.5 3.6 800 8.6 7.4 6.4 5.4 4.6 3.7 850 8.7 7.5 6.5 5.5 4.7 3.8 900 8.8 7.6 6.6 5.6 4.8 3.9 950 7.3 6.6 5.7 4.8 4.0 1000 6.7 5.7 4.9 4.0 1050 5.8 5.0 4.1 1100 5.9 5.0 4.2 1150 5.1 4.2 1200 5.1 4.3 1250 Not Operational 4.2 1300 1350 Frequency (MHz) More Efficient" Less Efficient" Nominal" 1GHz@1V! 7.3 GFLOPS/W! Max Frequency" 1.3GHz@1.2V! 4.2 GFLOPS/W! Most Efficient" 250MHz@0.65V! 16.7 GFLOPS/W! VDD at 0.8V" 550MHz@0.8V! 12.5 GFLOPS/W! 15!

Energy Efficiency Comparison" @0.8V" Frequency (GHz)" 64-bit GFLOPS" Power" (W)" Efficiency (GFLOPS/W)" Blue Gene/Q! 1.60! 204.8! 29.7! 6.9! IBM Cell! 3.20! 108.8! 22.5! 4.8! This Work" 0.55" 1.72" 0.138" 12.5" BG/Q and IBM Cell fabricated in same 45nm SOI! Conservatively assume BG/Q and Cell achieves peak GFLOPS, we achieve 78% of peak GFLOPS! Power numbers only for the core with private caches! Blue Gene/Q: Cores dissipate 54% of total power! IBM Cell: Assume that cores dissipate 50% of total power! Why better energy efficiency than others?! Simpler, but yet more energy-efficient microarchitecture! 16!

More on Comparison" But BG/Q is clocked 3X faster and Cell is 6X faster?! If the end goal is to provide better energy efficiency then use simpler microarchitectures and rely on parallelism for performance.! But BG/Q and Cell have big on-chip caches? What about I/O power?! We only count the power dissipated in the core and the private L1 caches.! But BG/Q and Cell have 100X more total GFLOPS!! Sorry, we only had budget for a small test chip.! 17!

Conclusions" Processor generators written in high-level languages can produce energy-efficient, high-performance hardware! Our dual-core ISC-V vector processor achieves 16.7 DP GFLOPS/W at 0.65 V and a maximum frequency of 1.3 GHz at 1.2 V! Open-source ISC-V ISA can serve as a competitive base ISA for integrating specialized heterogeneous accelerators! ocket chip generator and software tools open-sourced at http://riscv.org! 18!

Acknowledgment" DAPA award H0011-11-C-0100! DAPA award H0011-12-2-0016! Center for Future Architecture esearch, a member of STAnet, a Semiconductor esearch Corporation program sponsored by MACO! NVIDIA graduate fellowship! ASPIE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, Oracle, and Samsung! All POEM team members at MIT, UC Berkeley, CU Boulder! 19!