Automatic Generation of a Code Generator for SHARC ADSP-2106x

Similar documents
DSP Platforms Lab (AD-SHARC) Session 05

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

DSP VLSI Design. Addressing. Byungin Moon. Yonsei University

Dynamic Control Hazard Avoidance

Embedded C for High Performance DSP Programming with the CoSy Compiler Development System

Instruction scheduling

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Java and CoSy Technology for Embedded Systems: the JOSES Project

6.001 Notes: Section 4.1

Technical Questions. Q 1) What are the key features in C programming language?

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 11 Instruction Sets: Addressing Modes and Formats

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

1 INTRODUCTION. Purpose. Audience. Figure 1-0. Table 1-0. Listing 1-0.

Computer Organization & Assembly Language Programming

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

Engineer To Engineer Note

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

DSP VLSI Design. Instruction Set. Byungin Moon. Yonsei University

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Understanding Sources of Inefficiency in General-Purpose Chips

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Advanced FPGA Design Methodologies with Xilinx Vivado

Programming Style. Quick Look. Features of an Effective Style. Naming Conventions

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Multiple Choice Questions. Chapter 5

ECE260: Fundamentals of Computer Engineering

INTRODUCTION TO DIGITAL SIGNAL PROCESSOR

Code Compression for DSP

Domains Geometry Definition

Lecture Notes on Garbage Collection

Functions and Procedures

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Pointers II. Class 31

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

Semantic Analysis. Lecture 9. February 7, 2018

An Optimizing Compiler for the TMS320C25 DSP Chip

Separate compilation. Topic 6: Runtime Environments p.1/21. CS 526 Topic 6: Runtime Environments The linkage convention

A Feasibility Study for Methods of Effective Memoization Optimization

CA Compiler Construction

CS 101, Mock Computer Architecture

FORTH SEMESTER DIPLOMA EXAMINATION IN ENGINEERING/ TECHNOLIGY- OCTOBER, 2012 DATA STRUCTURE

Question Bank Subject: Advanced Data Structures Class: SE Computer

COMPILER CONSTRUCTION FOR A NETWORK IDENTIFICATION SUMIT SONI PRAVESH KUMAR

MACHINE INDEPENDENCE IN COMPILING*

CS 426 Parallel Computing. Parallel Computing Platforms

Memory Systems IRAM. Principle of IRAM

TABLES AND HASHING. Chapter 13

CSE 504: Compiler Design. Intermediate Representations Symbol Table

Compilers. Intermediate representations and code generation. Yannis Smaragdakis, U. Athens (original slides by Sam

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

The basic operations defined on a symbol table include: free to remove all entries and free the storage of a symbol table

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls

22 File Structure, Disk Scheduling

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler Front-End

3 TUTORIAL. In This Chapter. Figure 1-0. Table 1-0. Listing 1-0.

Introduction to Compiler Construction

An Instruction Stream Compression Technique 1

Sardar Vallabhbhai Patel Institute of Technology (SVIT), Vasad M.C.A. Department COSMOS LECTURE SERIES ( ) (ODD) Code Optimization

TECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are

Memory Allocation. Static Allocation. Dynamic Allocation. Dynamic Storage Allocation. CS 414: Operating Systems Spring 2008

HPC VT Machine-dependent Optimization

Directory Structure and File Allocation Methods

,1752'8&7,21. Figure 1-0. Table 1-0. Listing 1-0.

Feldspar A Functional Embedded Language for Digital Signal Processing *

Automatic Format Generation Techniques For Network Data Acquisition Systems

Control Instructions. Computer Organization Architectures for Embedded Computing. Thursday, 26 September Summary

Control Instructions

D Programming Language

WACC Report. Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow

CS415 Compilers. Intermediate Represeation & Code Generation

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Lecture 4: Instruction Set Design/Pipelining

Assembly Code Conversion of Software-Pipelined Loop between two VLIW DSP Processors

Lecture 7: Binding Time and Storage

LOW-COST SIMD. Considerations For Selecting a DSP Processor Why Buy The ADSP-21161?

What do Compilers Produce?

An introduction to Digital Signal Processors (DSP) Using the C55xx family

Project Compiler. CS031 TA Help Session November 28, 2011

Introduction to Compiler Construction

multiple variables having the same value multiple variables having the same identifier multiple uses of the same variable

NOTE: Answer ANY FOUR of the following 6 sections:

UNIT TESTING OF C++ TEMPLATE METAPROGRAMS

COMPILER DESIGN - RUN-TIME ENVIRONMENT

Fixed-Point Math and Other Optimizations

Principles of Programming Languages COMP251: Syntax and Grammars

Intro. Scheme Basics. scm> 5 5. scm>

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Introduction to Compiler Construction

HW1 Solutions. Type Old Mix New Mix Cost CPI

Model-based Software Development

Section 6 Blackfin ADSP-BF533 Memory

Performance. frontend. iratrecon - rational reconstruction. sprem - sparse pseudo division

CS1102: Macros and Recursion

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997

The role of semantic analysis in a compiler

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Transcription:

Automatic Generation of a Code Generator for SHARC ADSP-2106x Peter Aronsson, Levon Saldamli, Peter Fritzson (petar, levsa, petfr)@ida.liu.se Dept. of Computer and Information Science Linköping University, Sweden August 3, 1999 1 Abstract New DSP processors with increasingly complex instruction sets are continously being developed. To master such complexity it is becoming essential to quickly provide efficient high level language compilers for these processors. This paper describes the use of new compiler generation tools (CoSy) to automatically generate a code generator for the Digital Signal Processor SHARC ADSP 2106x from a description of its instruction set. The resulting C compiler was produced by two master students in 5 months, generating production-quality code. This gives an indication of the power and flexibility of generator tools, compared to traditional manual compiler implementations. 2 Introduction This paper describes the generation and implementation of a code generator for the digital signal processor SHARC ADSP- 2106x from Analog Devices Inc., by using the Back End Generator tool, BEG, which is a part of the CoSy compiler generation system[2][5]. CoSy is a compiler development tool, developed by ACE (bv) as a spinoff product from the ESPRIT projects COMPARE and PREPARE. New DSP processors are developed all the time, therefore to quickly develop compiler for new DSP processors is important for the acceptance of these processors. To develop an entire new compiler for each new processor being developed is far too expensive. By using a compiler construction tool such as CoSy, several advantages are gained. First, since a compiler in CoSy is built up of several modules which can be reused in other compilers, the development time decreases substantially. In fact, implementing a new compiler for a certain DSP requires only the code generator to be instructed. The other modules, such as the front end, can be reused. Another great advantage is that generators for modules, or engines as they are called in CoSy, exist for optimizers and backends. These generators generate complete, or almost complete, engines from specifications. By using these generators, it is easier to guarantee a compiler of higher quality. Of course, then the generator tool must be well tested, so that it doesn t contain errors. The C frontend delivered with CoSy has an optional DSP-C extension, which allows C programmers to, for instance, de-

2 clare variables in different memory banks, or to declare a variable of type fixed point number, a type very common in DSP applications. It is also possible to declare an array to be circular, another common data structure in DSP software. The Back End Generator, BEG, generates a set of engines, that work on the internal representation produced by a frontend, and produces the output file containing the program in the specified target language. In this case, the target language of the compiler is assembler code for the SHARC processor. BEG uses pattern matching combined with dynamic programming to translate the internal representation, which has the form of an abstract syntax tree, to assembler instructions. The internal representation is called CCMIR, which means Common CoSy Medium-level Intermediate Representation. Patterns are reduced to nonterminals, which can correspond to values stored in registers, or perhaps addressing modes. BEG also generates engines for register allocation and for instruction scheduling. In the current release of CoSy these work independently of each other. This has some disadvantages, especially when the processor has a VLIW architecture, since many operations have register constraints when executing in parallel. 3 DSP-C extension The DSP-C extensions to the C language is totally integrated in the fronted engine, which translates the program into CCMIR. The CCMIRs type system has support for the DSP specific variable declarations. For instance the code: accum acc_val; fixed D signal[48]; fixed P coeff[48]; declares a variable of type accum, which is a fixed point number with both fractional and integral part. It also declares two arrays of fixed point number type, i.e. a number with only a fractional part. The array named signal is declared to be stored in the data memory, hence the D keyword. The array named coeff is declared to be stored in program memory. The DSP-C extension has also support for circular arrays, i.e. an array can be declared as circular. Indexing the array beyond the boundary is safe because it wraps around back into the correct range. All type information is stored in the CCMIR and can be used by the backend to produce effective code. 4 Back End Generator BEG uses pattern matching and dynamic programming to select the best instructions for a given subtree in the CCMIR[2][4]. The Code Generator Description file (CGD-file), which is the input to BEG, consists of a set of rules and nonterminals, and a description for the scheduler. Each rule has a pattern to match in order for the rule to apply. If the rule applies it can reduce the part of the tree covered by the pattern to a nonterminal. A nonterminal can be one of four different types: Register to represent a value stored in registers. Memory to represent a storage in memory. Addrmode to represent an addressing mode. Unique for values stored in some unique location.

The rules and nonterminals are illustrated by the following example: x = y + 1; The TEMPLATE keyword tells the scheduler which resource template this rule allocates, i.e which resources the assembler operation needs. In this case the operation is performed in the ALU, thus allocating a resource template named alu. The templates are also specified in the code generator description file. It supports allocating arbitrary resources for an arbitrary amount of cycles. mirassign mirobjectaddr mirplus x mircontent mirobjectaddr y mirintconst 1 4.1 Instructing the Scheduler All operations in the SHARC processor has a latency of one[1]. That means the result of all operations are available in the next instruction cycle. However, BEG has support for setting different latencies for each rule/instruction. This is common for several DSP architectures and it sets higher constraints on the instruction scheduler. Figure 1: The Pattern Matching of rules on the CCMIR tree. The statement above can be covered by the rules as shown in figure 1. Each area corresponds to a rule covering that specific tree. For instance, the mirplus node can be reduced to a nonterminal that holds the value of the operation in a register. In order for that rule to match, the children of the mirplus node must be covered by rules reducing them also to nonterminals holding their value in a register. The mirplus rule looks like this: RULE mirplus(rs:reg, c:mirintconst) -> rd:reg; COND { c.value == 1 } COST 3; TEMPLATE alu; EMIT { emit(add1,rs,rd); } Many DSP architectures has register constraints on specific operations. For instance, the SHARC ADSP-2106x can issue an operation using the multiplier and the ALU in the same instruction cycle[1]. This can however only be performed in the same cycle if the operands are taken from specific subsets of the register file. BEG has support for, in a rule, specifying constraints on which registers to be used. This is specified by adding the allowed registers after the nonterminal in the pattern. For instance, the rules for issuing a multiply and an ALU operation in the same cycle looks like this: RULE [bi_multrealspec] o:mirmult (r1:reg<r0..r3>, r2:reg<r4..r7>) -> r:reg; TEMPLATE mulspec;.. RULE [bi_plusspec] o:mirplus (r1:reg<r8..r11>, r2:reg<r12..r15>) -> r:reg; TEMPLATE aluspec;.. The template for the two rules above

4 are declared as taking up the multiplier resource and the ALU resource respectively. Thus the two rules can be issued in the same instruction cycle. An ordinary mirplus rule has the alu template resource, which actually allocates all three functional units, since in general, only one compute operation can be performed in a single cycle. Since the register allocation is performed prior to the instruction scheduling, this implementation can sometimes produce slower code than without the register constraints. Consider if the rules above are chosen, but the register allocator needs to perform a spill in order to fulfill the constraints. Then, this approach will produce two extra instructions, one for spilling and one for restoring the register. The best way to handle this problem would be to integrate the register allocator with the instruction scheduler. However this is not possible in BEG, without rewriting all generated engines yourself. Another solution could be to run the backend twice for each procedure. The first run would use register constraints on the rules, and the second run without these constraints. Then the backend could be instructed to select the best result. 4.2 Implementing Post Modify Addressing Mode The SHARC has, along with several other DSPs, a specific addressing mode for updating address pointers after accessing memory. This is very efficient when sequentially accessing the values in an array, as typically is done in for instance a FIR filter. When trying to implement this in BEG some problems occur. First all expressions accessing arrays with indexes must be transformed into pointer expressions, so that the pointer can be post incremented. Fortunately there exists an engine in the CoSy system that does this. Another problem is that the post modify instruction actually originates from two statements in the CCMIR, the pointer increment statement and the memory access statement. Beg cannot reduce two different statements into one nonterminal, so this special case has to be handled separately. The solution was to handle them as two separate operations, and if the scheduler schedules them in the same instruction, then they are rewritten to a single operation. 5 Results The backend was compared on a number of programs. Figure 2 gives some test results. C-file # instructions ILP a b c a b c 8q.c 238 164 223 0 3 3 fir.c 29 14 19 0 26 12 mov.c 165 132 176 0 7 5 mat.c 126 83 105 0 7 2 vss.c 42 25 32 0 12 22 Figure 2: Test Results for the compiler, compared with g21k from Analog Devices. a is the g21k without optimization. b is the g21k with optimization. c is our compiler with optimization. ILP means percentage of instructions issued in parallel. The file 8q.c is the eight queens problem. It contains some nested loops and recursive function calls. The file fir.c is a simple FIR filter. The last three files contains matrix and vector manipulations. The tests presented here are only small examples showing that the compiler does almost as good as g21k, the commer-

cial compiler from Analog Devices. When comparing the assembler files from the two compilers one can detect that the major difference is that the g21k compiler has software pipelining implemented for a set of standard loops, such as a FIR filter. This optimization isn t yet available in CoSy. Runtime tests are presented in [6]. 6 Conclusions A drawback in BEG is that the scheduler only schedules per basic block. This limits the schedulers option to pack instructions. In order to get a better schedule, some algorithm working on larger code segments has to be used, like for instance trace scheduling[3]. However, these limitations didn t affect the implemented code generator for the SHARC processor that much. Mostly because the ADSP 2106x has in general only two issue slots, containing a compute operation and a move operation. If some register constraints are fulfilled, three issue slots can be performed. Additionally, two compute operations, one in the ALU and one in the multiplier, can be run in the same instruction as a move operation. Some test results from the backend gave rather low percentage of parallel instructions. Typically between 5 and 30 percent. A conclusion drawn from this work is that a code generator for a processor can be implemented in about eight to ten man-months, resulting in a compiler that produces almost as good code as a commercial compiler. Of course, a better backend, supporting more optimization, can be produced if the development time is increased a bit. Note also that the work also included learning the CoSy system. This is a substantial part of the effort, since CoSy is a large system that takes a while to fully understand and master. Approximately half of the time of the work was dedicated to learning the system. This learning process was integrated with development, which had the effect that some design decisions, now afterwards, probably could have been better. To summarize, one could say that developing a compiler using the CoSy system is far more resource efficient and less error prone than using conventional methods. The fact that many optimizers and DSP extensions already exist in the CoSy system makes the development time even shorter. References [1] Inc. Analog Devices. ADSP 2106x SHARC User s Manual. Analog Devices, Inc., first edition, 1995. [2] Niclas Andersson and Peter Fritzson. Overview and industrial application of code generator generators. Journal of Systems and Software, 1995. [3] J.A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, 30(7):478 490, 1981. [4] R. Landwehr H. Emmelmann, F. W. Schrrer. Beg - a generator for efficient back ends. ACM Sigplan Notices, 24(7):227 237, 1989. [5] Hans von Someren Martin Alt, Uwe Assmann. Cosy compiler phase embedding with the cosy compiler model. In Peter A. Fritzson, editor, Compiler Construction, 1994. [6] Levon Saldamli Peter Aronsson. Code generator for sharc adsp-2106x. Master s thesis, Dept. of Computer and Information Science, Linköping University, 1999.