LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS

Similar documents
LACore: A Large-Format Vector Accelerator for Linear Algebra Applications

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Multi2sim Kepler: A Detailed Architectural GPU Simulator

GPU Fundamentals Jeff Larkin November 14, 2016

Parallel Processing SIMD, Vector and GPU s cont.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Parallel Computer Architecture - Basics -

Parallel Computing: Parallel Architectures Jin, Hai

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Versal: AI Engine & Programming Environment

RISC Processors and Parallel Processing. Section and 3.3.6

Computer Organization + DIGITAL DESIGN

Intel Performance Libraries

GRAPHICS PROCESSING UNITS

CUDA. Matthew Joyner, Jeremy Williams

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.5. Krste Asanovic UC Berkeley Fall 2009

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Arm Processor Technology Update and Roadmap

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

A Brief Description of the NMP ISA and Benchmarks

Advanced CUDA Optimization 1. Introduction

Tesla GPU Computing A Revolution in High Performance Computing

Master Informatics Eng.

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Course web site: teaching/courses/car. Piazza discussion forum:

Discussion 5: Connecting to Rocket. CS250 Spring 2016 Christopher Yarp

An Advanced Graph Processor Prototype

Auto-tunable GPU BLAS

Lecture 1: Introduction and Computational Thinking

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Understanding GPGPU Vector Register File Usage

Main Points of the Computer Organization and System Software Module

Handout 3. HSAIL and A SIMT GPU Simulator

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Tesla GPU Computing A Revolution in High Performance Computing

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

VERIFICATION OF RISC-V PROCESSOR USING UVM TESTBENCH

Warps and Reduction Algorithms

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

From Application to Technology OpenCL Application Processors Chung-Ho Chen

Practical Introduction to CUDA and GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

The Processor: Instruction-Level Parallelism

Composite Metrics for System Throughput in HPC

Deep Learning Accelerators

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

State-of-the-art in Heterogeneous Computing

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Parallelism. CS6787 Lecture 8 Fall 2017

CUDA OPTIMIZATIONS ISC 2011 Tutorial

GPUfs: Integrating a file system with GPUs

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Static Compiler Optimization Techniques

Processor (IV) - advanced ILP. Hwansoo Han

HPCC Results. Nathan Wichmann Benchmark Engineer

Intel released new technology call P6P

Simulating Multi-Core RISC-V Systems in gem5

I) The Question paper contains 40 multiple choice questions with four choices and student will have

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lecture 2: CUDA Programming

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Compilers and Code Optimization EDOARDO FUSELLA

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

EECS4201 Computer Architecture

CUDA Programming Model

Copyright 2012, Elsevier Inc. All rights reserved.

Compiling for HSA accelerators with GCC

Optimization solutions for the segmented sum algorithmic function

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Exotic Methods in Parallel Computing [GPU Computing]

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Processors. Young W. Lim. May 12, 2016

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Hardware-Based Speculation

KiloCore: A 32 nm 1000-Processor Array

Introduction to CUDA

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing

ECE 8823: GPU Architectures. Objectives

Transcription:

1 LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS Samuel Steffl and Sherief Reda Brown University, Department of Computer Engineering Partially funded by NSF grant 1438958 Published as LACore: A Supercomputing-Like Linear Algebra Accelerator for SoC-Based Designs at IEEE Conference on Computer Design, 2017

Lightweight Linear Algebra in HPC 2 Linear Algebra is a foundation of High Performance Computing (HPC) Most HPC apps can be reduced to a handful of computation classes Modern Accelerators have numerous shortcomings Lightweight Manycore (GPU): memory transfers, poor sequential mode, synchronization bottleneck Fixed-Function Accelerators (FPGA/ASIC): costly and limited applicability ü Sparse Linear Algebra ü Dense Linear Algebra ü FFT ü Structured Grids ü Unstructured Grids ü Machine Learning ü Artificial Intelligence Proposed Architecture: LACore targets wide range of applications and overcome current hardware issues Design and Development of the LACore consisted of: 1. Architecture and microarchitecture design 2. Instruction Set (ISA) design and implementation in gcc 3. Creation of C-Programming Framework, called LACoreAPI 4. Implementation of the LACore in the gem5 C++ simulator 5. Evaluating the LACore with the HPCC Benchmark Suite

LACore Architecture: Overview 3 LACore extension to scalar CPU Scalar CPU blocks in white 3 1 4 LACore Pieces in blue: 1. LAExecUnit is a mixedprecision systolic datapath connected to LAMemUnits with FIFOs 2. LAMemUnits read and write data-streams to the datapath FIFOs and LACache/Scratchpad 3. LACfgs provide configuration to LAMemUnits 4. 64 kb Scratchpad provides lowlatency high-throughput temporary storage 5. 64 kb LACache with 4-ports and 16-banks is interface to the memory hierarchy 2 5

LACore Architecture: LAExecUnit 4-operands: 3 inputs (A,B,C) and 1 output (D) Dual Precision (32 and 64 bit) Systolic Datapath: connected to LAMemUnits through FIFOs Vector Unit: 8 Parallel VecNodes with 8 different configurations Reduction Unit: Binary tree with 7 ReduceNodes and 1 AccumulateNode, with 3 different configurations 24 datapath configurations 4 VecNode Ops ReduceNode Ops (A+B)*C (A+B)/C sum(x,y) (A-B)*C (A-B)/C min(x,y) (A*B)+C (A*B)-C max(x,y) (A/B)+C (A/B)-C

LACore Architecture: LAMemUnit 5 Handle all memory/scratchpad interactions streams data into the LAExecUnit s datapath Can read or write scalars, vectors, matrices, and sparse-matrices Handles 32-bit and 64-bit floating point 3 main components: Next-Addr Generator determines the location of the next item in the stream Memory Controller handles the cache and scratchpad communication protocol FIFO Controller reads and writes data to the FIFOs attached to the datapath 1 2 3

LACore Architecture: LACfg, LACache, Scratchpad 6 8 LACfg configuration registers 294 bits per LACfg Connected by crossbar switch to LAMemUnits to provide data-stream configuration Hold all info about data-stream type, precision, location, etc LACache 64 kb with 128 Byte line size 3 read ports and 1 write port 16 line-interleaved banks Scratchpad 64 kb with 128 Byte line size Separate address space 3 read ports and 1 write port

LACore Instruction Set and LACoreAPI 7 RISC-V ISA Extension 68 new 32-bit LACore instructions full gcc toolchain support 3 classes of instructions configuration data-movement execution Programming Framework: LACoreAPI C-programming framework Contains 3 classes of APIs: Configuration Data-Movement Execution

HPCC Evaluation in gem5 8 Implemented LACore in gem5 industry standard cycle-accurate c++ system simulator ~25,000 lines of C++/Python Implemented HPC Challenge (HPCC) Benchmark DGEMM, FFT, PTRANS, STREAM, HPL and Random Access 3 other architectures evaluated for comparison In-order pipelined RISC-V core Superscalar x86 core with SSE2 extensions Equivalent Fermi GPU with 2 Streaming Multiprocessors Overall HPCC Benchmarks Suite LACore Outperforms x86 by average of 3.43x LACore Outperforms RISC-V by average of 10.72x LACore Outperforms GPU by average of 12.04x LACore has competitive area utilization Only 2.53x the area of a single RISC-V scalar CPU Only 0.60x the area of the equivalent GPU DGEMM FFT PTRANS HPL Random Access STREAM - Triad x86 5.73x 2.4x 4.14x 2.10x 0.7x 4.3x RISC-V 31.7x 4.23x 1.52x 4.67x 1.0x 13x GPU 1.6x 7.98x 1.98-0.3x 47x

LACore is Freely Available 9 https://github.com/scale-lab/la-core In spirit of RISC-V open hardware and open-source gem5 Installation guide riscv-tools with LACore extension cross-compiling GSL for RISC-V building the HPCC benchmarks for LACore, RISC-V, x86 and CUDA installing and building gem5 models for LACore, RISC-V, x86 and CUDA How to extend LACore codebase Developer Docs emphasis on effectively programming for the LACore ISA provides full LaCoreAPI usage examples

LACore simulators and benchmarks 10 https://github.com/scale-lab/la-core Multiple simulators available for LACore available as reference for other RISC-V-based accelerators riscv-isa-sim (aka spike ): gold-standard RISC-V functional simulator gem5 64-bit LACore models AtomicLACoreSimpleCPU: purely functional TimingLACoreSimpleCPU: single-cycle instructions + cycle-accurate memory access MinorLACoreCPU: in-order pipelined CPU + cycle accurate memory access other gem5 models gem5-gpu CUDA model, in-order pipelined RISC-V model, out-of-order x86 model Full HPCC Benchmarks for all platforms available full DGEMM, FFT, PTRANS, STREAM, HPL and Random Access kernels easily reproduce the benchmark results in this presentation use as a reference for other applications targeting the LACore

Conclusions and Future Work 11 Conclusions LACore is a Large-Format vector accelerator for a broad range of Linear Algebra applications LACore has novel architectural features including as the: configurable, data-streaming LAMemUnits dual-precision, configurable, systolic LAExecUnit A compiler toolchain, programming framework and architectural simulator were all implemented for the LACore The LACore significantly outperforms the x86, RISC-V and GPU for linear algebra applications LACore is open-source Future work LACore multi-core design and evaluation LACore ASIC implementation and eventual tapeout Thank You Questions?