Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005

Similar documents
Digital Signal Processor Core Technology

C-Based Hardware Design Platform for Dynamically Reconfigurable Processor

Math 96--Radicals #1-- Simplify; Combine--page 1

FABRICATION TECHNOLOGIES

XPU A Programmable FPGA Accelerator for Diverse Workloads

MIPS Technologies MIPS32 M4K Synthesizable Processor Core By the staff of

System-on Solution from Altera and Xilinx

An introduction to Digital Signal Processors (DSP) Using the C55xx family

D. Richard Brown III Associate Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department

FPGA Polyphase Filter Bank Study & Implementation

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

General Purpose Signal Processors

High performance, power-efficient DSPs based on the TI C64x

B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes

REAL TIME DIGITAL SIGNAL PROCESSING

Jason Manley. Internal presentation: Operation overview and drill-down October 2007

The S6000 Family of Processors

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

THE NVIDIA DEEP LEARNING ACCELERATOR

PACE: Power-Aware Computing Engines

The Xilinx XC6200 chip, the software tools and the board development tools

The extreme Adaptive DSP Solution to Sensor Data Processing

Graduate Institute of Electronics Engineering, NTU 9/16/2004

CNP: An FPGA-based Processor for Convolutional Networks

Cache Justification for Digital Signal Processors

Better sharc data such as vliw format, number of kind of functional units

High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS

Embedded Target for TI C6000 DSP 2.0 Release Notes

Understanding Peak Floating-Point Performance Claims

asoc: : A Scalable On-Chip Communication Architecture

USING C-TO-HARDWARE ACCELERATION IN FPGAS FOR WAVEFORM BASEBAND PROCESSING

FlexRIO. FPGAs Bringing Custom Functionality to Instruments. Ravichandran Raghavan Technical Marketing Engineer. ni.com

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension

The Nios II Family of Configurable Soft-core Processors

Embedded Controller Design. CompE 270 Digital Systems - 5. Objective. Application Specific Chips. User Programmable Logic. Copyright 1998 Ken Arnold 1

Industrial Control. 50 to 5,000 VA. Applications. Specifications. Standards. Options and Accessories. Features, Functions, Benefits.

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One

Industrial Control. 50 to 5,000 VA. Applications. Specifications. Standards. Options and Accessories. Features, Functions, Benefits.

Versal: AI Engine & Programming Environment

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Embedded Systems. 7. System Components

Design Space Exploration for Memory Subsystems of VLIW Architectures

Classification of Semiconductor LSI

Agenda. How can we improve productivity? C++ Bit-accurate datatypes and modeling Using C++ for hardware design

14K Mixable Rings 1/ 2 ctw, $1499. Silver Diamond Pendant 1/6 ctw, $299

Stacks & Queues. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Memory Systems and Performance Engineering

A Reconfigurable Multifunction Computing Cache Architecture

Spiral 2-8. Cell Layout

SoC Basics Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions

REAL TIME DIGITAL SIGNAL PROCESSING

Representation of Numbers and Arithmetic in Signal Processors

Introduction to the MMAGIX Multithreading Supercomputer

EECS150 - Digital Design Lecture 09 - Parallelism

Queues. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc.

Simplifying FPGA Design for SDR with a Network on Chip Architecture

REAL TIME DIGITAL SIGNAL PROCESSING

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Course Overview Revisited

VICP Signal Processing Library. Further extending the performance and ease of use for VICP enabled devices

RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER

ADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

An Optimizing Compiler for the TMS320C25 DSP Chip

Lecture 41: Introduction to Reconfigurable Computing

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

Specializing Hardware for Image Processing

Microprocessors vs. DSPs (ESC-223)

Two-level Reconfigurable Architecture for High-Performance Signal Processing

Dependable VLSI Platform using Robust Fabrics

Evaluating MMX Technology Using DSP and Multimedia Applications

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Qsys and IP Core Integration

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

IMPLICIT+EXPLICIT Architecture

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

ECE 471 Embedded Systems Lecture 2

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Parallel Numerical Algorithms

Unifi 45 Projector Retrofit Kit for SMART Board 580 and 560 Interactive Whiteboards

Benchmarking Multithreaded, Multicore and Reconfigurable Processors

Memory Systems and Performance Engineering. Fall 2009

Computer Architecture s Changing Definition

Design of Reusable Context Pipelining for Coarse Grained Reconfigurable Architecture

Teacher Assignment and Transfer Program (TATP) On-Line Application Quick Sheets

Compilers and Code Optimization EDOARDO FUSELLA

ECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing

KiloCore: A 32 nm 1000-Processor Array

Microarchitecture Overview. Performance

Implementation of DSP Algorithms

This calculation converts 3562 from base 10 to base 8 octal. Digits are produced right to left, so the final answer is 6752.

TMS320C6000 Programmer s Guide

CMSC 2833 Lecture Memory Organization and Addressing

Section 8: Monomials and Radicals

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

Transcription:

Ascenium: A Continuously Reconfigurable Architecture Robert Mykland Founder/CTO robert@ascenium.com August, 2005

Ascenium: A Continuously Reconfigurable Processor Continuously reconfigurable approach provides: The computational efficiency of direct logic implementations in ASICs All the flexibility of microprocessors (pure software) Unique architecture enables the array to be effectively targeted by an ANSI Standard C compiler CPU logic is continuously redefined to match the compiler s needs for each instruction ( 20 cycles) Massive parallelism even in programs with little or no inherent instruction-level parallelism Dozens of lines of sequential C code are executed every instruction

Ascenium Block Diagram No instruction set RLU is combinatorial logic that settles Memory is accessed in statically scheduled block operations Two instruction streams 1) Memory 2) Configuration Whole loops are often executed as one RLU instruction Reconfigurable Logic Unit (RLU) Config Control In Out Memory Ascenium Memory Control

Ascenium Basic Operation 12 RISC Instructions 5 10 11 12 1 3 4 5 6 7 8 9 265 145 155 5 10 20 25 30 35 40 ALU Register File Control Reconfigurable Logic Unit (RLU) Config Control In Out Memory Control Memory Conventional Clock = 24 25 10 11 12 13 14 15 17 18 19 20 21 22 23 24 Memory Ascenium

A128 Rough Layout Pad Ring RLU Inst. Pipe Data Pipe Onchip Memory 130 nm process 128 -bit logic elements 64 MACs/clock cycle 200 MHz DDR II = 3.2 GB/s external memory bandwidth 256KB onchip memory 19.2 GB/s peak onchip memory bandwidth 500 mw @ 200 MHz 52 mm 2

Logic Element Input A ( bits) Input B ( bits) Input C ( bits) ½ -Bit Full Multiplier - Accumulator Fast Carry -Bit Add LUT ( bits) LUT Y ( bits) Wide Logic Iterator ( bits) Iterator Y ( bits) Iterator Z ( bits) Output ( bits) Output Y ( bits)

Array Connectivity

Compiler Block Diagram Library (open source) Bytecode files Source code C, C, Java GCC Based Compiler (open source) Bytecode Bytecode Linker (open source) Bytecode Proprietary Ascenium Code Generator (develop) Ascenium Executable

Code Generator Diagram LLVM PARSER GENERIC INT. REP. (GIR) GLOBAL THREADING GIR OPTS. ASCENIUM INT. REP. (AIR) FUNC. ASSIGNMENT & SCORING ARRAY INST. PACKING & COMP. DATA DIRECTIVES FETCH ORDER ASSY. AIR OPTS. OUTPUT indicates where optimizations occur

Generic Intermediate Representation (GIR) PARAM A PARAM B PARAM A LOADI PARAM A CALL RET * LOADI STORE RET RET

FFT Inner Loop C Code i1 = i0 n2; i2 = i1 n2; i3 = i2 n2; r1 = x[2 * i0] x[2 * i2]; r2 = x[2 * i0] - x[2 * i2]; t = x[2 * i1] x[2 * i3]; x[2 * i0] = r1 t; r1 = r1 - t; s1 = x[2 * i0 1] x[2 * i2 1]; s2 = x[2 * i0 1] - x[2 * i2 1]; t = x[2 * i1 1] x[2 * i3 1]; x[2 * i0 1] = s1 t; s1 = s1 - t; x[2 * i2] = (r1 * co2 s1 * si2) >>15; x[2 * i2 1] = (s1 * co2-r1*si2)>>15; t = x[2 * i1 1] - x[2 * i3 1]; r1 = r2 t; r2 = r2 - t; t = x[2 * i1] - x[2 * i3]; s1 = s2 - t; s2 = s2 t; x[2 * i1] = (r1 * co1 s1 * si1)>>15; x[2 * i1 1] = (s1 * co1-r1 * si1)>>15; x[2 * i3] = (r2 * co3 s2 * si3)>>15; x[2 * i3 1] = (s2 * co3-r2 *si3)>>15;

FFT Inner Loop Virtual Circuit 0 2 0 2 1 3 Y0 Y2 Y0 Y2 Y1 Y3 Y1 Y3 1 3 A r1 - B C r2 t1 s1 s2 D - E F - G - H t2 t3 t4 - I - J K C2 S2 C2 S2 C1 S1 C1 S1 C3 S3 C3 r3 s3 r4 r5 s4 s5 - L - M N S3 O P Q R S T U V W Y Z 32 AA 32 BB 32 CC 32 DD 32 EE 32 FF GG HH >>15 II >>15 JJ >>15 KK >>15 LL >>15 MM >>15 NN 0o Y0o 2o Y2o 1o Y1o 3o Y3o

FFT Loop Array Instruction AA BB CC DD O P R Q S T V U B A E D 0 Y0 2 Y2 C2 S2 W H C EE G F 1 Y1 3 Y3 C3 S3 NN Z Y K I C1 Y3 o MM KK FF M J S1 1 3 o o LL L N GG HH 0 Y1 o o JJ II Y0 2 o Y2 o o A 0 Y0 2 Y2 C2 S2 D C F I J GG HH B E H G K M L O 1 Y1 3 Y3 C3 S3 C1 S1 Y3 1 o 3 o o 0 Y1 o o N P Q T U AA R BB S CC V DD W Y LL II Z EE FF JJ NN KK MM Y0 o2 oy2 o

FFT Loop Data Directives 1. Fetch the array instruction and data directives 2. Load the array instruction 3. Read x[t] in four 64-bit wide banks 32 times 4. Write x[t] in one 256-bit wide bank 32 times, changing banks every 8 operations 5. Repeat steps 3 and 4 three more times 6. At the same time as 3 5, read in w coefficients 7. Pass 1: w every clock; 2: w every 4 clocks, then every, then every 64 8. Stop

DSP Benchmark Results (Simulated, Hand-Coded) Benchmark 256-Point Radix-4 FFT Real Block FIR Filter Complex Block FIR Filter DCT x 24 IIR Filter NR = 2,048 A128 core, 130nm, 250MHz, 250mW 512 ns 512 ns 512 ns 512 ns 292 ns TMS320C64x core, 130nm, 300MHz*, 250mW 8,920 ns 6,827 ns 6,827 ns 6,080 ns 6,827 ns * The 130nm C64x core will operate up to 720 MHz, but burns more power at higher frequencies.

Ascenium Status Patents filed Chip architecture complete, polishing ongoing Prototype compiler complete & demo-able Development plan in place Raising seed round

Summary Great performance and performance/mw: >10x programmable DSPs >4x performance/mw of FPGAs Great time-to-market: -- Normal GPP software development cycle Great ease-of-use: -- Programmers writing standard, non-optimized code in C/C get all the performance