Re-configurable VLIW processor for streaming data

International Workshop NGNT 97 Re-configurable VLIW processor for streaming data V. Iossifov Studiengang Technische Informatik, FB Ingenieurwissenschaften 1, FHTW Berlin. G. Megson School of Computer Science, Cybernetics and Electronic Engineering, University of Reading Abstract This paper describes the ISA-level design of one re-configurable VLIW processor for streaming data applications with alternating data width. Design of re-configurable data stream processor. Design of VLIW processor for the re-configurable approach. Data, control and address path design of the configurable VLIW. Generating the FPGA code - VLIW re-configurable procedure. Open problems and concluding remarks. Keywords Hardware Genetic Algorithm Research at RUCS, VLIW processor, the FPGA code, Streaming Data 1 The Re-configurable Computing Approach This paper describes the ISA-level design of one re-configurable VLIW processor for streaming data applications with alternating data width. This design is based on the original designs of Hardware Genetic Algorithm Research at RUCS, Reading [1], Free configurable RISC processor for streaming data applications with different data widths at FHTW Berlin [3], and the Freedom CPU Project [5] for the host CPU. 1.1 Programmable Processors The stored programme processor with ISA architecture is the basics of computer architectures for at least two reasons: It allowed non-permanent customisation and application development after fabrication. It reused the same active computing resources in time in order to support large computations on small amounts of processing hardware. To make these possible, architects continued to rely on large memories to economically hold task descriptions and intermediate data and small amounts of active processing which is heavily multiplexed to perform the actual computations. The efficiency of the architecture for different data formats tells us what the architecture can provide when the task requirements match the architectural assumptions. If the task requires the native manipulation of small data words on a large word machine, we will yield only a fraction of that peak. Fig.1. Spatial vs. Temporal Computation for the expression y = Ax 2 + Bx + C [2]. 1.2 Re-configurable devices Re-configurable devices can be configured after fabrication to solve any computational task. These devices are best exemplified today by FPGA. In these re-

International Workshop NGNT 98 configurable devices, tasks are implemented by spatially composing primitive operations and operators with the possibility of temporally changing the hardware of the operators rather then temporally composing of instructions sequences in Princeton style processors. The re-configurable processor on FPGA can perform different operations on each bit, so re-configurable devices can be optimised to the data width of streaming data flows. The central theme of this work is to mix the advantages of Non-von-Neumann architectures with the advantages of re-configurable processing elements. 2 Design of re-configurable data stream processor 2.1 Configurable general-purpose devices Configurable architectures can perform any of a number of different operations. Once the instruction has been "configured" into the device, it is not changed during a data stream of equal data type is continuing. Configuration context is the collection of FPGA control bits that describe the behaviour of a general-purpose computing device on one operation cycle of few instructions for a data stream with defined data width. One programming stream for a conventional FPGA containing instructions for every array element along with interconnect composes a "configuration context". Integer data streams with variable data appear in application such: Video & 3D software algorithms Video encoding/decoding that operate in blocks of data FIR filter algorithms that operate on stream of data The re-configurable VLIW processor to be developed, have to compute integer numbers of 8-, 16-, 32- and 64 bit data width by dedicated register files and ALU in parallel. The register files, internal busses and ALU are re-configurable to the data width required. 2.2 The re-configurable streaming data approach Streaming Data applications require maximum performance for architectures with a customised number of instructions. This paper [3] explores the possibility of enabling a partial customisability of the instruction set of VLIW processors for embedded Streaming Data applications, by exploiting FPGA technology. In particular the formal methodology presented in [4] is modified for the custom instruction sets used for Streaming Data algorithms to select the computational hot spots in it. The novelty of the proposed method is the customising of the method for analysing the Control Graph in [4] to given Streaming Data application with different data widths of the operands to be implemented via reconfigurable R-CPU on FPGA. A skeleton of the proposed design flow is depicted in [3], Figure 2. This development focuses first, according to [4] on the construction of a theoretical model and of a strategy to identify the Streaming Data customised operations to be implemented via re-configurable R-CPU with different data width. A new op-code denoted in [4] as the fpga-opcode is correspondingly generated and it replaces the relevant segment of computation in the translation from high level code into machine code. The new fpga-opcode is made available to the compiler as an extension of the machine instruction set and information such as latency of the fpga-opcode which must be known for scheduling is also given. With this target architecture the computational procedure becomes that of extracting from the application algorithm the segments of computation that are to be implemented as fpga-opcodes. This approach, proposed in [4], and re-designed in this paper identifies the Streaming Data instructions based on the Control Graph (CG) corresponding to the application, from which suitable sub-graphs for operations with the same data width are extracted. Analysing the CG of the application algorithm identifies the Streaming Data instructions to be mapped onto the parallel R-CPU. The aim is to identify sub-algorithms with Streaming Data instructions and the usefully mapping onto a dedicated R-CPU [3],

International Workshop NGNT 99 [4]. The Binary Input and Unary Output (BIUO) nodes of the CG have two inputs at most and fan-out equal to one. 2.3 Formal definition of a BIUO A formal definition of a BIUO sub Control Graph B i/j is as follows: Denote by G i =< V i,j ; E i/j > is a sub-graph where V i,j is the set of nodes in G i where i={0,1,2 input edges, j = {1 output edges and E i/j is the set of all edges in G i departing from such nodes. An edge e i/j E i/j is described by its source node (v I,j V i,j ) and its destination node v I,j V i,j and it is denoted by e i/j (k, l). If for all v I,j V I,j it is true that e i/j (k, l) E i/j ; v I,j V i,j. Then G is BIUO. Any node in V I,j may have incoming edges originating from nodes not belonging to V I,j. The above property can be used as the basis for an algorithm (described in [3]) that extracts Streaming Data operations nodes (BIUO) from all computational hot spot nodes in the CG. The upper bound on CG build of BIUO is a binary tree with all topological properties of the binary tree. If n are a number of operands, V i,j = n-1, E i/j = 2n-1. 2.4 BIUO nodes extracting lemma Lemma 1 in [4] has to be converted for BIUO nodes as: All BIUOs in the CG are either BIUO or contained in a BIUO. The proof is immediate. In the following the algorithm for the identification of all BIUO in a CG in [4] is modified for BIUO operations and the re-configurable PU to be generated for this operations: { Node Nodes_to_be analysed do { { Generate BIUO(Node) Nodes_to_be_analysed - = Nodes_in_BIUO Generate_BIUO_nodes (Node) { for (node_index=number_of_nodes, node_index > 0; node_index --) if (fan-out==0&&fan-in==0) { Generate_fixed_PU_Node else if (fan-out==1&&fan-in==1) { Generate_BIUO_PU_Node else if (fan-out==1&&fan-in==2) { Generate_BIUO_PU_Node else Generate_fixed_PU_Node Fig. 2. Pseudo-code for the generation of all BIUO within the CG. The algorithm operates in two steps: first, a node is chosen to be the exit node, then the program activates a function which builds the BIUO related to such exit node. Exit nodes are chosen upwards, i.e. starting from the exits of the CG. Initially, the set of Nodes_to_be_analysed coincides with the set of nodes of the CG. When a BIUO has been generated, its nodes are removed from the Nodes to be analysed set. The function Generate BIUO starts from the chosen exit_node and recursively tries to include its parents in the BIUO being generated. Recursion ends when the encountered node is nonlegal (e.g. it is a non-streaming Data instruction) or has a non re-convergent fan-out. The proposed algorithm shows a complexity linear with the number of nodes in the examined CG as the algorithm proposed in [4].

International Workshop NGNT 100 3 Design of VLIW processor for the re-configurable approach 3.1 Re-configurable RISC CPU for variable data widths - the calculator The re-configurable CPU core is a two-address machine with RISC ISA architecture and orthogonal GPR register file. Address bus width of 16 bit Data busses width of 8-, 16-, 32- and 64 bit for the different units (ALU, GPR) 3.2 Re-configurable Systolic array - the data width sorter The re-configurable Systolic array - the data width sorter is based on the hardware research in [6->1]. The research in Generic Algorithms (GA) is centred on the development of a novel design which uses systolic arrays. The generic concept is extended by exploiting the pipeline principle to design a device that is independent of the lengths of the chromosomes being used in a particular problem. The systolic arrays themselves are easily scalable to implement different population sizes. Prototype systolic array cells have been designed and targeted to the Xilinx XC4000 FPGA [1]. 3.3 Re-configurable VLIW-CPU instruction set and format The first task designing the instruction set is to discuss the instruction to join the instruction set for the data stream approach in order to ensure ISA and EXO compatibility of the processor. Each VLIW instruction has 8 major fields: The Systolic sorter fields controls the systolic operation ALU and the global LOAD/STORE operations via crossbar. The information on the streaming data type sorted on every data output of the systolic sorter is coded as output in the FPGA Condition Code Registers of the systolic sorter. The R-CPUa, R-CPUb, R-CPUc and R-CPUd fields control the four R-CPU s function. The R-CPU is a two-address machine. The FPU_memory and FPU_control fields controls the 32 bit RISC Fixed Procesor Unit (FPU) in performing LOAD/STORE and/or control oprerations [5]. The FPGA-code contains the FPGA-SRAM images of the RPU and systolic units. The VLIW control code in [3] Consider, for example, the following instruction format: size : 32 8, 8 free 16/24 16/24 16/24 16/24 8 6/8 bits : 0 31 32 47 48 63 64 79 80 95 96 111 111 119 120 127 function: F-CPU Systolc sorter R-CPU R-CPU R-CPU R-CPU FPGA code VLIW control Fig. 3. The VLIW-CPU instruction format. 4 Data, control and address path design of the configurable VLIW The VLIW core implements the host function for the systolic sorter and the four reconfigurable R-CPU calculators. Furthermore, the VLIW core executes all ALU, control and LOAD/STORE instructions in the program, there are not streaming data instructions. The task of the VLIW core is to synchronise as Out-of-Order the operations of the R-CPU and the systolic sorter, to execute the FPGA-code to reconfigure the R-CPUs and to invoice the LOAD/STORE operations for the systolic sorter (Fig 4.). The crossbar between the R-CPU data registers, the main memory, and the execution units is a central part of the VLIW architecture. The R-CPU data register set is read-only through this device which virtually provides it with than four ports. The crossbar extends the R-CPU data register set's read ports, making four "vertical" buses for all R-CPU and each bus is connected to one of the input ports of the Dual-port-memory with "horizontal" buses. It also performs some width formatting (byte, word, etc). Accessing a R-CPU data register takes two cycles

International Workshop NGNT 101 from the time the register number has been decoded: one cycle for the register set and another for the crossbar. Fig. 4.The VLIW-CPU architecture. 5 Generating the FPGA code - VLIW re-configurable procedure The task of the systolic sorter is to generate a condition code for the different data widths as the result of sorting the streaming data. The compiler prior to execution of the application code drives reconfigurations of the FPGA, or possibly at the beginning of every section of code that requires reconfiguration. Some systolic sorter driven procedure designs for activating the fpga-code in the FPU are discussed in [3]. 6 Open problems and concluding remarks This paper presents the ISA level behavioural design of an "Re-configurable VLIW processor for data streams with variable word width". The topics below are open problems - behavioural description of the systolic array sorter, the data RAM, the VLIW crossbar, of the re-configurable data busses in the VLIW 7 References [1] Bland I.M., Megson, G.M., The systolic array genetic algorithm, an example of systolic arrays as a reconfigurable design methodology, Proc 6th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM98), IEEE Computer Society. ISBN 0-8186-8900-5, August1998. http://www.pedal.rdg.ac.uk/pubmain.htm [2] DeHon, Andre, Re-configurable Architectures for General-Purpose Computing, A.I. Technical Report No. 1586, M.I.T. Artifical Intelligence Lab., Oct. 1996. [3] Iossifov, V., Megson, G.M., Re-configurable VLIW processor for data streams with variable word width, Technical report RUCS, University of Reading, July 2000. http://dozenten.f1.fhtw-berlin.de/jossifov/publikationen/ [4] Pozzi, L., Methodolgies for design of Application-Specific Re-configurable VLIW Processors, PhD Thesis, Politecnico di Milano, Dip. di Elettronica e Informazione, Jan. 2000. [5] Freedom CPU Project F-CPU: http://fcpu.tux.org/manual/summary.html#summary [6] What Is Re-configurable Computing? http://pw1.netcom.com/~optmagic/reconfigure/whatisrc.html