Re-configurable VLIW processor for streaming data

Similar documents
Reconfigurable Computing. Introduction

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design

The S6000 Family of Processors

CS Computer Architecture

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

ECE 486/586. Computer Architecture. Lecture # 7

CS 101, Mock Computer Architecture

Computer Architecture 2/26/01 Lecture #

Chapter 4. The Processor

Computer Architecture

Integrating MRPSOC with multigrain parallelism for improvement of performance

A Process Model suitable for defining and programming MpSoCs

Processor Design. Introduction, part I

Basic Computer Architecture

Novel Design of Dual Core RISC Architecture Implementation

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

EE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

R.W. Hartenstein, et al.: A Reconfigurable Arithmetic Datapath Architecture; GI/ITG-Workshop, Schloß Dagstuhl, Bericht 303, pp.

Chapter 4. The Processor

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

CS 24: INTRODUCTION TO. Spring 2018 Lecture 3 COMPUTING SYSTEMS

EC-801 Advanced Computer Architecture

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Chapter 4. The Processor Designing the datapath

Why Study Assembly Language?

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Processor design - MIPS

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER STRUCTURE AND ORGANIZATION

CPE300: Digital System Architecture and Design

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Single Pass Connected Components Analysis

Chapter One. Introduction to Computer System

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

I ve been getting this a lot lately So, what are you teaching this term? Computer Organization. Do you mean, like keeping your computer in place?

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

PIPELINE AND VECTOR PROCESSING

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

EE/CSCI 451: Parallel and Distributed Computation

Parallel Solutions of the Longest Increasing Subsequence Problem Using Pipelined Optical Bus Systems

SAE5C Computer Organization and Architecture. Unit : I - V

Unit 9 : Fundamentals of Parallel Processing

MARIE: An Introduction to a Simple Computer

ECE 571 Advanced Microprocessor-Based Design Lecture 3

Cpu Architectures Using Fixed Length Instruction Formats

Instruction Set Overview

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One

THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Computer Architecture

Chapter 4. The Processor

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control,

The Processor: Instruction-Level Parallelism

Computer Architecture

55:132/22C:160, HPCA Spring 2011

A Scalable Multiprocessor for Real-time Signal Processing

Incremental Reconfiguration for Pipelined Applications

Instruction Set Architecture. "Speaking with the computer"

Computer Systems Organization

Copyright 2007 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol.

Blog -

Computer Architecture

Processor (I) - datapath & control. Hwansoo Han

ENGN1640: Design of Computing Systems Topic 03: Instruction Set Architecture Design

An Instruction Stream Compression Technique 1

New Advances in Micro-Processors and computer architectures

EE 3170 Microcontroller Applications

The Von Neumann Architecture. Designing Computers. The Von Neumann Architecture. CMPUT101 Introduction to Computing - Spring 2001

Universität Dortmund. ARM Architecture

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

ETH, Design of Digital Circuits, SS17 Review Session Questions I

CMPUT101 Introduction to Computing - Summer 2002

Designing Computers. The Von Neumann Architecture. The Von Neumann Architecture. The Von Neumann Architecture

Real instruction set architectures. Part 2: a representative sample

Module 2: Introduction to AVR ATmega 32 Architecture

Multiple Instruction Issue. Superscalars

Computer Organization

Major Advances (continued)

Network-on-Chip Micro-Benchmarks

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Computer Systems Architecture

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Design of memory efficient FIFO-based merge sorter

Structured Datapaths. Preclass 1. Throughput Yield. Preclass 1

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Introduction to reconfigurable systems

A Streaming Multi-Threaded Model

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital

EC 513 Computer Architecture

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CC312: Computer Organization

Transcription:

International Workshop NGNT 97 Re-configurable VLIW processor for streaming data V. Iossifov Studiengang Technische Informatik, FB Ingenieurwissenschaften 1, FHTW Berlin. G. Megson School of Computer Science, Cybernetics and Electronic Engineering, University of Reading Abstract This paper describes the ISA-level design of one re-configurable VLIW processor for streaming data applications with alternating data width. Design of re-configurable data stream processor. Design of VLIW processor for the re-configurable approach. Data, control and address path design of the configurable VLIW. Generating the FPGA code - VLIW re-configurable procedure. Open problems and concluding remarks. Keywords Hardware Genetic Algorithm Research at RUCS, VLIW processor, the FPGA code, Streaming Data 1 The Re-configurable Computing Approach This paper describes the ISA-level design of one re-configurable VLIW processor for streaming data applications with alternating data width. This design is based on the original designs of Hardware Genetic Algorithm Research at RUCS, Reading [1], Free configurable RISC processor for streaming data applications with different data widths at FHTW Berlin [3], and the Freedom CPU Project [5] for the host CPU. 1.1 Programmable Processors The stored programme processor with ISA architecture is the basics of computer architectures for at least two reasons: It allowed non-permanent customisation and application development after fabrication. It reused the same active computing resources in time in order to support large computations on small amounts of processing hardware. To make these possible, architects continued to rely on large memories to economically hold task descriptions and intermediate data and small amounts of active processing which is heavily multiplexed to perform the actual computations. The efficiency of the architecture for different data formats tells us what the architecture can provide when the task requirements match the architectural assumptions. If the task requires the native manipulation of small data words on a large word machine, we will yield only a fraction of that peak. Fig.1. Spatial vs. Temporal Computation for the expression y = Ax 2 + Bx + C [2]. 1.2 Re-configurable devices Re-configurable devices can be configured after fabrication to solve any computational task. These devices are best exemplified today by FPGA. In these re-

International Workshop NGNT 98 configurable devices, tasks are implemented by spatially composing primitive operations and operators with the possibility of temporally changing the hardware of the operators rather then temporally composing of instructions sequences in Princeton style processors. The re-configurable processor on FPGA can perform different operations on each bit, so re-configurable devices can be optimised to the data width of streaming data flows. The central theme of this work is to mix the advantages of Non-von-Neumann architectures with the advantages of re-configurable processing elements. 2 Design of re-configurable data stream processor 2.1 Configurable general-purpose devices Configurable architectures can perform any of a number of different operations. Once the instruction has been "configured" into the device, it is not changed during a data stream of equal data type is continuing. Configuration context is the collection of FPGA control bits that describe the behaviour of a general-purpose computing device on one operation cycle of few instructions for a data stream with defined data width. One programming stream for a conventional FPGA containing instructions for every array element along with interconnect composes a "configuration context". Integer data streams with variable data appear in application such: Video & 3D software algorithms Video encoding/decoding that operate in blocks of data FIR filter algorithms that operate on stream of data The re-configurable VLIW processor to be developed, have to compute integer numbers of 8-, 16-, 32- and 64 bit data width by dedicated register files and ALU in parallel. The register files, internal busses and ALU are re-configurable to the data width required. 2.2 The re-configurable streaming data approach Streaming Data applications require maximum performance for architectures with a customised number of instructions. This paper [3] explores the possibility of enabling a partial customisability of the instruction set of VLIW processors for embedded Streaming Data applications, by exploiting FPGA technology. In particular the formal methodology presented in [4] is modified for the custom instruction sets used for Streaming Data algorithms to select the computational hot spots in it. The novelty of the proposed method is the customising of the method for analysing the Control Graph in [4] to given Streaming Data application with different data widths of the operands to be implemented via reconfigurable R-CPU on FPGA. A skeleton of the proposed design flow is depicted in [3], Figure 2. This development focuses first, according to [4] on the construction of a theoretical model and of a strategy to identify the Streaming Data customised operations to be implemented via re-configurable R-CPU with different data width. A new op-code denoted in [4] as the fpga-opcode is correspondingly generated and it replaces the relevant segment of computation in the translation from high level code into machine code. The new fpga-opcode is made available to the compiler as an extension of the machine instruction set and information such as latency of the fpga-opcode which must be known for scheduling is also given. With this target architecture the computational procedure becomes that of extracting from the application algorithm the segments of computation that are to be implemented as fpga-opcodes. This approach, proposed in [4], and re-designed in this paper identifies the Streaming Data instructions based on the Control Graph (CG) corresponding to the application, from which suitable sub-graphs for operations with the same data width are extracted. Analysing the CG of the application algorithm identifies the Streaming Data instructions to be mapped onto the parallel R-CPU. The aim is to identify sub-algorithms with Streaming Data instructions and the usefully mapping onto a dedicated R-CPU [3],

International Workshop NGNT 99 [4]. The Binary Input and Unary Output (BIUO) nodes of the CG have two inputs at most and fan-out equal to one. 2.3 Formal definition of a BIUO A formal definition of a BIUO sub Control Graph B i/j is as follows: Denote by G i =< V i,j ; E i/j > is a sub-graph where V i,j is the set of nodes in G i where i={0,1,2 input edges, j = {1 output edges and E i/j is the set of all edges in G i departing from such nodes. An edge e i/j E i/j is described by its source node (v I,j V i,j ) and its destination node v I,j V i,j and it is denoted by e i/j (k, l). If for all v I,j V I,j it is true that e i/j (k, l) E i/j ; v I,j V i,j. Then G is BIUO. Any node in V I,j may have incoming edges originating from nodes not belonging to V I,j. The above property can be used as the basis for an algorithm (described in [3]) that extracts Streaming Data operations nodes (BIUO) from all computational hot spot nodes in the CG. The upper bound on CG build of BIUO is a binary tree with all topological properties of the binary tree. If n are a number of operands, V i,j = n-1, E i/j = 2n-1. 2.4 BIUO nodes extracting lemma Lemma 1 in [4] has to be converted for BIUO nodes as: All BIUOs in the CG are either BIUO or contained in a BIUO. The proof is immediate. In the following the algorithm for the identification of all BIUO in a CG in [4] is modified for BIUO operations and the re-configurable PU to be generated for this operations: { Node Nodes_to_be analysed do { { Generate BIUO(Node) Nodes_to_be_analysed - = Nodes_in_BIUO Generate_BIUO_nodes (Node) { for (node_index=number_of_nodes, node_index > 0; node_index --) if (fan-out==0&&fan-in==0) { Generate_fixed_PU_Node else if (fan-out==1&&fan-in==1) { Generate_BIUO_PU_Node else if (fan-out==1&&fan-in==2) { Generate_BIUO_PU_Node else Generate_fixed_PU_Node Fig. 2. Pseudo-code for the generation of all BIUO within the CG. The algorithm operates in two steps: first, a node is chosen to be the exit node, then the program activates a function which builds the BIUO related to such exit node. Exit nodes are chosen upwards, i.e. starting from the exits of the CG. Initially, the set of Nodes_to_be_analysed coincides with the set of nodes of the CG. When a BIUO has been generated, its nodes are removed from the Nodes to be analysed set. The function Generate BIUO starts from the chosen exit_node and recursively tries to include its parents in the BIUO being generated. Recursion ends when the encountered node is nonlegal (e.g. it is a non-streaming Data instruction) or has a non re-convergent fan-out. The proposed algorithm shows a complexity linear with the number of nodes in the examined CG as the algorithm proposed in [4].

International Workshop NGNT 100 3 Design of VLIW processor for the re-configurable approach 3.1 Re-configurable RISC CPU for variable data widths - the calculator The re-configurable CPU core is a two-address machine with RISC ISA architecture and orthogonal GPR register file. Address bus width of 16 bit Data busses width of 8-, 16-, 32- and 64 bit for the different units (ALU, GPR) 3.2 Re-configurable Systolic array - the data width sorter The re-configurable Systolic array - the data width sorter is based on the hardware research in [6->1]. The research in Generic Algorithms (GA) is centred on the development of a novel design which uses systolic arrays. The generic concept is extended by exploiting the pipeline principle to design a device that is independent of the lengths of the chromosomes being used in a particular problem. The systolic arrays themselves are easily scalable to implement different population sizes. Prototype systolic array cells have been designed and targeted to the Xilinx XC4000 FPGA [1]. 3.3 Re-configurable VLIW-CPU instruction set and format The first task designing the instruction set is to discuss the instruction to join the instruction set for the data stream approach in order to ensure ISA and EXO compatibility of the processor. Each VLIW instruction has 8 major fields: The Systolic sorter fields controls the systolic operation ALU and the global LOAD/STORE operations via crossbar. The information on the streaming data type sorted on every data output of the systolic sorter is coded as output in the FPGA Condition Code Registers of the systolic sorter. The R-CPUa, R-CPUb, R-CPUc and R-CPUd fields control the four R-CPU s function. The R-CPU is a two-address machine. The FPU_memory and FPU_control fields controls the 32 bit RISC Fixed Procesor Unit (FPU) in performing LOAD/STORE and/or control oprerations [5]. The FPGA-code contains the FPGA-SRAM images of the RPU and systolic units. The VLIW control code in [3] Consider, for example, the following instruction format: size : 32 8, 8 free 16/24 16/24 16/24 16/24 8 6/8 bits : 0 31 32 47 48 63 64 79 80 95 96 111 111 119 120 127 function: F-CPU Systolc sorter R-CPU R-CPU R-CPU R-CPU FPGA code VLIW control Fig. 3. The VLIW-CPU instruction format. 4 Data, control and address path design of the configurable VLIW The VLIW core implements the host function for the systolic sorter and the four reconfigurable R-CPU calculators. Furthermore, the VLIW core executes all ALU, control and LOAD/STORE instructions in the program, there are not streaming data instructions. The task of the VLIW core is to synchronise as Out-of-Order the operations of the R-CPU and the systolic sorter, to execute the FPGA-code to reconfigure the R-CPUs and to invoice the LOAD/STORE operations for the systolic sorter (Fig 4.). The crossbar between the R-CPU data registers, the main memory, and the execution units is a central part of the VLIW architecture. The R-CPU data register set is read-only through this device which virtually provides it with than four ports. The crossbar extends the R-CPU data register set's read ports, making four "vertical" buses for all R-CPU and each bus is connected to one of the input ports of the Dual-port-memory with "horizontal" buses. It also performs some width formatting (byte, word, etc). Accessing a R-CPU data register takes two cycles

International Workshop NGNT 101 from the time the register number has been decoded: one cycle for the register set and another for the crossbar. Fig. 4.The VLIW-CPU architecture. 5 Generating the FPGA code - VLIW re-configurable procedure The task of the systolic sorter is to generate a condition code for the different data widths as the result of sorting the streaming data. The compiler prior to execution of the application code drives reconfigurations of the FPGA, or possibly at the beginning of every section of code that requires reconfiguration. Some systolic sorter driven procedure designs for activating the fpga-code in the FPU are discussed in [3]. 6 Open problems and concluding remarks This paper presents the ISA level behavioural design of an "Re-configurable VLIW processor for data streams with variable word width". The topics below are open problems - behavioural description of the systolic array sorter, the data RAM, the VLIW crossbar, of the re-configurable data busses in the VLIW 7 References [1] Bland I.M., Megson, G.M., The systolic array genetic algorithm, an example of systolic arrays as a reconfigurable design methodology, Proc 6th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM98), IEEE Computer Society. ISBN 0-8186-8900-5, August1998. http://www.pedal.rdg.ac.uk/pubmain.htm [2] DeHon, Andre, Re-configurable Architectures for General-Purpose Computing, A.I. Technical Report No. 1586, M.I.T. Artifical Intelligence Lab., Oct. 1996. [3] Iossifov, V., Megson, G.M., Re-configurable VLIW processor for data streams with variable word width, Technical report RUCS, University of Reading, July 2000. http://dozenten.f1.fhtw-berlin.de/jossifov/publikationen/ [4] Pozzi, L., Methodolgies for design of Application-Specific Re-configurable VLIW Processors, PhD Thesis, Politecnico di Milano, Dip. di Elettronica e Informazione, Jan. 2000. [5] Freedom CPU Project F-CPU: http://fcpu.tux.org/manual/summary.html#summary [6] What Is Re-configurable Computing? http://pw1.netcom.com/~optmagic/reconfigure/whatisrc.html