Automatic Generation of a Code Generator for SHARC ADSP-2106x Peter Aronsson, Levon Saldamli, Peter Fritzson (petar, levsa, petfr)@ida.liu.se Dept. of Computer and Information Science Linköping University, Sweden August 3, 1999 1 Abstract New DSP processors with increasingly complex instruction sets are continously being developed. To master such complexity it is becoming essential to quickly provide efficient high level language compilers for these processors. This paper describes the use of new compiler generation tools (CoSy) to automatically generate a code generator for the Digital Signal Processor SHARC ADSP 2106x from a description of its instruction set. The resulting C compiler was produced by two master students in 5 months, generating production-quality code. This gives an indication of the power and flexibility of generator tools, compared to traditional manual compiler implementations. 2 Introduction This paper describes the generation and implementation of a code generator for the digital signal processor SHARC ADSP- 2106x from Analog Devices Inc., by using the Back End Generator tool, BEG, which is a part of the CoSy compiler generation system[2][5]. CoSy is a compiler development tool, developed by ACE (bv) as a spinoff product from the ESPRIT projects COMPARE and PREPARE. New DSP processors are developed all the time, therefore to quickly develop compiler for new DSP processors is important for the acceptance of these processors. To develop an entire new compiler for each new processor being developed is far too expensive. By using a compiler construction tool such as CoSy, several advantages are gained. First, since a compiler in CoSy is built up of several modules which can be reused in other compilers, the development time decreases substantially. In fact, implementing a new compiler for a certain DSP requires only the code generator to be instructed. The other modules, such as the front end, can be reused. Another great advantage is that generators for modules, or engines as they are called in CoSy, exist for optimizers and backends. These generators generate complete, or almost complete, engines from specifications. By using these generators, it is easier to guarantee a compiler of higher quality. Of course, then the generator tool must be well tested, so that it doesn t contain errors. The C frontend delivered with CoSy has an optional DSP-C extension, which allows C programmers to, for instance, de-
2 clare variables in different memory banks, or to declare a variable of type fixed point number, a type very common in DSP applications. It is also possible to declare an array to be circular, another common data structure in DSP software. The Back End Generator, BEG, generates a set of engines, that work on the internal representation produced by a frontend, and produces the output file containing the program in the specified target language. In this case, the target language of the compiler is assembler code for the SHARC processor. BEG uses pattern matching combined with dynamic programming to translate the internal representation, which has the form of an abstract syntax tree, to assembler instructions. The internal representation is called CCMIR, which means Common CoSy Medium-level Intermediate Representation. Patterns are reduced to nonterminals, which can correspond to values stored in registers, or perhaps addressing modes. BEG also generates engines for register allocation and for instruction scheduling. In the current release of CoSy these work independently of each other. This has some disadvantages, especially when the processor has a VLIW architecture, since many operations have register constraints when executing in parallel. 3 DSP-C extension The DSP-C extensions to the C language is totally integrated in the fronted engine, which translates the program into CCMIR. The CCMIRs type system has support for the DSP specific variable declarations. For instance the code: accum acc_val; fixed D signal[48]; fixed P coeff[48]; declares a variable of type accum, which is a fixed point number with both fractional and integral part. It also declares two arrays of fixed point number type, i.e. a number with only a fractional part. The array named signal is declared to be stored in the data memory, hence the D keyword. The array named coeff is declared to be stored in program memory. The DSP-C extension has also support for circular arrays, i.e. an array can be declared as circular. Indexing the array beyond the boundary is safe because it wraps around back into the correct range. All type information is stored in the CCMIR and can be used by the backend to produce effective code. 4 Back End Generator BEG uses pattern matching and dynamic programming to select the best instructions for a given subtree in the CCMIR[2][4]. The Code Generator Description file (CGD-file), which is the input to BEG, consists of a set of rules and nonterminals, and a description for the scheduler. Each rule has a pattern to match in order for the rule to apply. If the rule applies it can reduce the part of the tree covered by the pattern to a nonterminal. A nonterminal can be one of four different types: Register to represent a value stored in registers. Memory to represent a storage in memory. Addrmode to represent an addressing mode. Unique for values stored in some unique location.
The rules and nonterminals are illustrated by the following example: x = y + 1; The TEMPLATE keyword tells the scheduler which resource template this rule allocates, i.e which resources the assembler operation needs. In this case the operation is performed in the ALU, thus allocating a resource template named alu. The templates are also specified in the code generator description file. It supports allocating arbitrary resources for an arbitrary amount of cycles. mirassign mirobjectaddr mirplus x mircontent mirobjectaddr y mirintconst 1 4.1 Instructing the Scheduler All operations in the SHARC processor has a latency of one[1]. That means the result of all operations are available in the next instruction cycle. However, BEG has support for setting different latencies for each rule/instruction. This is common for several DSP architectures and it sets higher constraints on the instruction scheduler. Figure 1: The Pattern Matching of rules on the CCMIR tree. The statement above can be covered by the rules as shown in figure 1. Each area corresponds to a rule covering that specific tree. For instance, the mirplus node can be reduced to a nonterminal that holds the value of the operation in a register. In order for that rule to match, the children of the mirplus node must be covered by rules reducing them also to nonterminals holding their value in a register. The mirplus rule looks like this: RULE mirplus(rs:reg, c:mirintconst) -> rd:reg; COND { c.value == 1 } COST 3; TEMPLATE alu; EMIT { emit(add1,rs,rd); } Many DSP architectures has register constraints on specific operations. For instance, the SHARC ADSP-2106x can issue an operation using the multiplier and the ALU in the same instruction cycle[1]. This can however only be performed in the same cycle if the operands are taken from specific subsets of the register file. BEG has support for, in a rule, specifying constraints on which registers to be used. This is specified by adding the allowed registers after the nonterminal in the pattern. For instance, the rules for issuing a multiply and an ALU operation in the same cycle looks like this: RULE [bi_multrealspec] o:mirmult (r1:reg<r0..r3>, r2:reg<r4..r7>) -> r:reg; TEMPLATE mulspec;.. RULE [bi_plusspec] o:mirplus (r1:reg<r8..r11>, r2:reg<r12..r15>) -> r:reg; TEMPLATE aluspec;.. The template for the two rules above
4 are declared as taking up the multiplier resource and the ALU resource respectively. Thus the two rules can be issued in the same instruction cycle. An ordinary mirplus rule has the alu template resource, which actually allocates all three functional units, since in general, only one compute operation can be performed in a single cycle. Since the register allocation is performed prior to the instruction scheduling, this implementation can sometimes produce slower code than without the register constraints. Consider if the rules above are chosen, but the register allocator needs to perform a spill in order to fulfill the constraints. Then, this approach will produce two extra instructions, one for spilling and one for restoring the register. The best way to handle this problem would be to integrate the register allocator with the instruction scheduler. However this is not possible in BEG, without rewriting all generated engines yourself. Another solution could be to run the backend twice for each procedure. The first run would use register constraints on the rules, and the second run without these constraints. Then the backend could be instructed to select the best result. 4.2 Implementing Post Modify Addressing Mode The SHARC has, along with several other DSPs, a specific addressing mode for updating address pointers after accessing memory. This is very efficient when sequentially accessing the values in an array, as typically is done in for instance a FIR filter. When trying to implement this in BEG some problems occur. First all expressions accessing arrays with indexes must be transformed into pointer expressions, so that the pointer can be post incremented. Fortunately there exists an engine in the CoSy system that does this. Another problem is that the post modify instruction actually originates from two statements in the CCMIR, the pointer increment statement and the memory access statement. Beg cannot reduce two different statements into one nonterminal, so this special case has to be handled separately. The solution was to handle them as two separate operations, and if the scheduler schedules them in the same instruction, then they are rewritten to a single operation. 5 Results The backend was compared on a number of programs. Figure 2 gives some test results. C-file # instructions ILP a b c a b c 8q.c 238 164 223 0 3 3 fir.c 29 14 19 0 26 12 mov.c 165 132 176 0 7 5 mat.c 126 83 105 0 7 2 vss.c 42 25 32 0 12 22 Figure 2: Test Results for the compiler, compared with g21k from Analog Devices. a is the g21k without optimization. b is the g21k with optimization. c is our compiler with optimization. ILP means percentage of instructions issued in parallel. The file 8q.c is the eight queens problem. It contains some nested loops and recursive function calls. The file fir.c is a simple FIR filter. The last three files contains matrix and vector manipulations. The tests presented here are only small examples showing that the compiler does almost as good as g21k, the commer-
cial compiler from Analog Devices. When comparing the assembler files from the two compilers one can detect that the major difference is that the g21k compiler has software pipelining implemented for a set of standard loops, such as a FIR filter. This optimization isn t yet available in CoSy. Runtime tests are presented in [6]. 6 Conclusions A drawback in BEG is that the scheduler only schedules per basic block. This limits the schedulers option to pack instructions. In order to get a better schedule, some algorithm working on larger code segments has to be used, like for instance trace scheduling[3]. However, these limitations didn t affect the implemented code generator for the SHARC processor that much. Mostly because the ADSP 2106x has in general only two issue slots, containing a compute operation and a move operation. If some register constraints are fulfilled, three issue slots can be performed. Additionally, two compute operations, one in the ALU and one in the multiplier, can be run in the same instruction as a move operation. Some test results from the backend gave rather low percentage of parallel instructions. Typically between 5 and 30 percent. A conclusion drawn from this work is that a code generator for a processor can be implemented in about eight to ten man-months, resulting in a compiler that produces almost as good code as a commercial compiler. Of course, a better backend, supporting more optimization, can be produced if the development time is increased a bit. Note also that the work also included learning the CoSy system. This is a substantial part of the effort, since CoSy is a large system that takes a while to fully understand and master. Approximately half of the time of the work was dedicated to learning the system. This learning process was integrated with development, which had the effect that some design decisions, now afterwards, probably could have been better. To summarize, one could say that developing a compiler using the CoSy system is far more resource efficient and less error prone than using conventional methods. The fact that many optimizers and DSP extensions already exist in the CoSy system makes the development time even shorter. References [1] Inc. Analog Devices. ADSP 2106x SHARC User s Manual. Analog Devices, Inc., first edition, 1995. [2] Niclas Andersson and Peter Fritzson. Overview and industrial application of code generator generators. Journal of Systems and Software, 1995. [3] J.A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, 30(7):478 490, 1981. [4] R. Landwehr H. Emmelmann, F. W. Schrrer. Beg - a generator for efficient back ends. ACM Sigplan Notices, 24(7):227 237, 1989. [5] Hans von Someren Martin Alt, Uwe Assmann. Cosy compiler phase embedding with the cosy compiler model. In Peter A. Fritzson, editor, Compiler Construction, 1994. [6] Levon Saldamli Peter Aronsson. Code generator for sharc adsp-2106x. Master s thesis, Dept. of Computer and Information Science, Linköping University, 1999.