High Level Software Cost Estimation

Size: px

Start display at page:

Download "High Level Software Cost Estimation"

Pamela West
5 years ago
Views:

1 High Level Software Cost Estimation Per Bjuréus Abstract This report is dedicated to the processor characterization method and software cost estimation technique used in the Polis Codesign tool environment. The processor characterization method has been exercised by applying it to the ARM processor family. In particular, two processors, ARM7TDMI and ARM920T, have been examined. An improved method is proposed, which is supported and partially automated by two utility tools. The improved method is based on an iterative two-pass technique. The first pass involves processor characteristic extraction from generic software templates. The second pass improves the parameters from the first pass using a validation method. The results obtained during the exercise are presented and discussed. In particular the effect of instruction and data cache memory is addressed. The estimation technique used today is well suited for processors without cache, but processors with cache calls for new techniques. Introduction High-level software cost estimation is an attractive feature in a system design flow. This allows the designer to estimate software code size and performance in an early design phase. The approach to high-level software cost estimation that is available in the Polis codesign environment developed at UC Berkeley [2] is based on work done by K. Suzuki et. al. [1]. Under the assumption that the software program can be represented as a set of directed acyclic graphs (DAG), called S-graphs, the estimation is performed using macro modeling. A number of macros, representing different types of nodes that constitute the S- graph, are collected in a set of template files. The template files are profiled for the processor that will be used, and size and time parameters for each individual macro is extracted. The profiling is only performed once, using an Instruction Set Simulator (ISS) or debugger for the processor in question. The macro parameters are collected in a parameter file, which is used to estimate the cost for any software program that runs on the processor. In this way there is no need for a designer to install and learn any simulators or debuggers to evaluate the software cost on different microprocessors. This amounts to a fast and convenient way to do design trade-off decisions between hardware, software, and functionality of an embedded system. The main objective of this project was to exercise and possibly improve the processor characterization methodology in Polis. The ARM processor family was selected as suitable microprocessors to perform the experiments on for several reasons. The ARM processors are widely used and has up until now not been available in the Polis environment. The ARM cores are a family of processors with different characteristics, which allows a wide range of different processor configurations to be analyzed without changing the experimental framework radically. The ARM processor comes with a Software Development Kit (SDK) that contains a set of tools in an open environment. 1

2 A second objective was to study the effects that instruction and data cache has on software cost estimation and to possibly suggest solutions to the expected problems. Some previous work in this direction has been performed by Lajalo et. al. [3]. Processor Characterization Processor Characterization in Polis is based on the assumption that the software program is decomposed into communicating Codesign Finite State-Machines (CFSMs) that are executed upon request by a scheduler or operating system. The communication between CFSMs is asynchronous, and a signal enables a CFSM when it is received. The scheduler executes enabled CFSMs according to a scheduling policy. Each CFSM can be represented by a polar directed acyclic graph (DAG) called an S-graph. When the CFSM is executed, only one execution path is traversed in the S-graph from the Begin to the End node. The S- graph is composed of a fixed set of node types. Each node type in an S-graph is represented by a macro, which is an atomic operation. A macro will eventually be executed as a sequence of instructions on a microprocessor. The idea is that if the code size and execution time for each macro can be estimated, so can the S-graph, and hence the CFSM. If the CFSMs are annotated with the estimated execution time, simulation of the system will yield a performance estimate of the whole system, this is referred to as performance simulation. The goal for processor characterization is thus to estimate the code size and execution time of the macros that constitute the S-graphs. All macro cost estimates are collected in a parameter file, which is processor specific. The parameter file is read by Polis, which annotates the software before performance simulation can be carried out. Estimating the macros involves compiling and analyzing the code for the intended processor. The current methodology for parameter file generation is outlined in Figure 1. Template Files Compiler Assembler Files Debugger Parameter File Figure 1. Processor Characterization Flow The macros are collected in template files that are written in ANSI C. A processor specific compiler compiles the template files, and assembler files are generated. The assembler files are analyzed manually or with a debugger or Instruction Set Simulator (ISS), and the code size and execution time for each macro is extracted. The code size and execution time estimates are manually collected in the processor specific parameter file. This approach requires a good knowledge about the compiler and debugger. It also involves a lot of tedious work analyzing the debugger output and converting it to a parameter file. 2

3 ;;;4 void tmp_avv(int proc, int inst) ;;;5 { tmp_avv e1a02001 MOV a3,a2 ;;;6 v_st1_enc = v_st2_enc; e59f1010 LDR a2,[pc, #L00001c-.-8] e LDR a2,[a2,#0] 00000c e59f300c LDR a4,[pc, #L ] e STR a2,[a4,#0] ;;;7 ;;;8 return; e1a0f00e MOV pc,lr ;;;9 } Figure 2. Compiled Macro with Interleaved Source Code Figure 2 shows an excerpt from the $99 macro that has been compiled into assembler code interleaved with the original macro C-code. Since the objective of this project was to evaluate several different processors in the ARM family, the above methodology was improved and automated for efficiency. The new methodology is outlined in Figure 3. Template Files Compiler Assembler Files Makefile Annotation Debugger Log File log2param Parameter File Archar Debugger Script Figure 3. Improved Processor Characterization Flow First, the template files are annotated with macro Entry and Exit points. A program, called $UFKDU, operates on the annotated template files and generates a debugger script. The log file generated by the debugger is converted by another program, ORJSDUDP, into a raw parameter file. Compilation, debugger execution, and log file conversion is performed by a Makefile, keeping track of changes and dependencies between the files. $UFKDU reads the template files and generates a debugger script that is used by the ARM symbolic debugger DUPVG for analysis. The annotated $99 macro is shown in Figure 4. 3

4 void tmp_avv(int proc, int inst) { /*Enter AVV*/ v_st1_enc = v_st2_enc; /*Exit AVV*/ return; } Figure 4. Annotated Macro Function The $UFKDU program generates a debugger script, which has three parts; a prefix, macro entry and exit commands, and a postfix. The prefix is used to configure the debugger. The entry and exit commands insert breakpoints at macro Entry and Exit. The breakpoints are programmed with a command that is executed when the breakpoint is reached. The breakpoint command outputs the macro name, the current program counter address, and the current cycle count. A portion of the debugger log file generated by the debugger running the debugger script generated by $UFKDU is shown in Figure 5. enter:avv Total exit:avv Total Figure 5. Excerpt From Debugger Log File To convert the debugger log file into a parameter file, another program, ORJSDUDP, was written. The program reads the log file, records the macro Entry and Exit points, and performs the necessary operations to output a parameter file. ORJSDUDP accepts nesting macro Entry and Exit points, allows multiple calls to the same macro, and supports several cycle count variables. Another feature of both $UFKDU and ORJSDUDP is that they make a distinction between macros and software library functions. A macro name followed by a colon and the IXQF keyword indicates a function. The IXQF keyword must be followed by the function output bit-width in parenthesis. An example of an annotated software library function is shown in Figure 6. void tmp_timesl(int proc, int inst) { /*Enter _TIMES:func(32)*/ v_2_enc = _TIMES(v_sL3_enc, v_const2l); /*Exit _TIMES:func(32)*/ return; } Figure 6. Annotated Software Library Function The corresponding log file generated by the debugger is shown in Figure 7. 4

5 enter:_times:func(32) Total exit:_times:func(32) Total Figure 7. Software Library Function Log If multiple calls to the macros are performed during debugger analysis, each call will generate an Entry and Exit point in the log file. The parameter file generated by ORJSDUDP will then contain the average execution time for macros, and the maximum and minimum execution times for software library functions. The parameter file generated by ORJSDUDP from the $99 macro and the 7,0(6 software library function is shown in Figure 8..time AVV 11.size AVV 16.dp func=times max_cycle=19 min_cycle=19 size=28 out_width=32 Figure 8. Parameter File Example The output from ORJSDUDP is a raw parameter file, which means that it still needs some manual massaging before it can be read into Polis. All software library functions involve the assignment of the function output to a variable. Thus, the assignment, which is belongs to a separate macro, is counted in the parameter for software library functions, and needs to be subtracted. For example, consider the software library function call in Figure 6. The size and execution time of the assignment (YBBHQF B7,0(6 ) must be subtracted from the size and execution time parameters for B7,0(6 respectively. Parameter values for the (1&, 7,(1&7, 7,(1&), and 7,(1& are currently not implemented in the template files, and it is unclear what their purpose is. All those parameters are set to zero. The pointer size parameter 375 is entered manually. The parameter file must contain the name of the processor, the units that are used, and the bit width of an integer variable; those lines are simply added to the beginning of the parameter file according to Figure 9..name ARM7TDMI.unit_time cycle.unit_size byte.int_width 32 Figure 9. Parameter File Additions for the ARM7TDMI When the parameter file has been modified it can be used by the Polis tool for size and execution time estimation. 5

6 Estimation Validation For validation purposes, an ATM switch example was selected, which will be referred to as the $70. The $70 consists of several modules, each representing one CFSM in the system. This section describes the design flow to synthesize the software, and to profile the software using the ARMulator. The validation flow is outlined in Figure 10. Esterel Program Polis Parameter File Source Files Cost Estimation Compiler Size? DoDelay Image Simulation Time? Debug Image Execution Figure 10. Validation Flow The parameter file from the processor characterization is used together with the $70 Esterel program. The Esterel program is compiled into a SHIFT file (Software Hardware Intermediate FormaT) using VWUOVKLIW. Polis converts the SHIFT file into an S-graph internal representation. The parameter file is used to assign weights to the edges of the S-graph, which enables Polis to generate a software source file and a cost estimation file. The cost estimation file contains the expected maximum and minimum execution time of each CFSM in the design along with a code size estimate. The Polis execution script that was used to generate the source files and the cost estimate file is depicted in Figure 11. read_shift atm_v.shift propagate_const set_impl -s partition build_sg set arch ARM7TDMI read_cost_param sg_to_c -D -d software gen_os -D os -d software set polisout software/sgraph.txt print_sg set polisout software/cost.txt print_cost -sn quit Figure 11. Polis Execution Script It is beyond the scope of this report to go into details about all the steps that Polis performs. However, it is worth pointing out that in order to run the $70 on a workstation for performance simulation, all modules (CFSMs) were implemented as software. 6

7 When the software source files have been generated by Polis, the application is built using the ARM project manager. All source files, including the generated OS file (RVF) and a file containing memory library functions (PHPBOLEF) are compiled and linked into an ARM executable image. The 81,;, (67(5(/, and %(1&+ variables are set during compilation to build a stand-alone executable image called 'HEXJ. The estimated code size for each module, found in the cost file (FRVWW[W), can be compared to the number reported by the map file generated by the ARM link tool. To measure the execution time, a debugger script was written that inserts breakpoints at module Entry and Exit points. The breakpoints are programmed to execute commands very similar to the Entry and Exit commands used for macro estimation. An input pattern (i.e. scenario) is applied to the application, and the cycle times are written to a log file at the module Entry and Exit points. The log file generated by the debugger has the same format as the log file generated during processor characterization and can be read by ORJSDUDP. The ORJSDUDP program has a simple mode, which does not generate a parameter file, but rather outputs an unformatted parameter file. An excerpt from such a validation parameter file is shown in Figure 12. collision_detector 604 bytes Instructions S_Cycles I_Cycles Total arbiter_sc 308 bytes Instructions S_Cycles I_Cycles Total Figure 12. Validation Parameter File The numbers from the validation parameter file are imported in Excel for analysis. The maximum and minimum execution times found in the cost file generated by Polis refers to the longest and shortest paths in the S-graph respectively. However, it is hard to run an exhaustive simulation of the application to make sure that exactly those execution paths are traversed. Therefore, for validation purposes, the estimated execution times of the execution paths actually traversed must be recorded instead. This is accomplished by building an executable image with the '2B'(/$< pre-processing variable set during compilation. The OS file has to be slightly modified to allow the debugger to run without user interaction through a debugger script. The image generated is called 'R'HOD\. It is important to remember that the 'R'HOD\ image cannot be used to measure the actual code size or execution time. When the 'R'HOD\ image is executed, it reads the input pattern from a file, and reports the estimated execution times upon completion. The numbers reported are imported in Excel for comparison with the measured values. Results Two ARM processor configurations were characterized and validated, the ARM7TDMI core, and the ARM920T processor. Two memory configurations were analyzed for both 7

8 processors; one fast and one slow. The fast memory was configured not to impose any wait-states on program execution. The slow memory configuration on the other hand had a sequential/non-sequential read and write latency of 120/90 ns, which required wait-states to be inserted during program execution. However, due to the current uncertainty of the analysis of the slow memory configuration, those results are omitted in this report. The ARM7TDMI core was not equipped with cache memory whereas the ARM920T processor was equipped with a 16KB Data Cache and a 16KB Instruction Cache. Both processors are 32-bit processors. The 16-bit Thumb instruction set has not been examined. Code Size The estimated and measured code size for the individual ATM modules was compared for the ARM7TDMI processor. Originally, the maximum absolute estimation error was 70%. The large difference was unexpected and the error source was investigated immediately, comparing the estimated size reported on each S-graph node with the actual size in the assembly program. Three major error sources were identified rather quickly: 1. The Polis tool did not estimate variables that were local to the modules. Each simple variable occupied 4 bytes data memory and an array occupied 4 bytes per element and an additional 8 bytes for reference variables (pointers). The initialization of a simple variable required 4 instructions (16 bytes), and initialization of an array required 6 instructions (24 bytes). 2. The QHWBTXLGPHPRU\, QHWBVWDWHPHPRU\, and QHWBPVXEBVRUW modules called the FUHDWH, JHWGDWD, and SXWGDWD memory library functions in PHPBOLEF. Each such function call required additional instructions compared to the generic 6:/ macro. FUHDWH required 1 more instruction, JHWGDWD required 4 more instructions, and SXWGDWD required 7 more instructions. 3. The 7,'7 macro and the $9& macro required one more instruction each. Those deficiencies were corrected manually in the Excel worksheet, and an absolute maximum error of 12% was achieved. The results are shown in Figure 13. ATM Size Estimation Error for ARM7TDMI arbiter_sc arbiter_sorter collision_detector counter extract_cell2 first_cell lqm_arbiter3 msd_technique net_m2/sub_sort net_quid/memory2 net_state/memory sorter2 space_controller supervisor3 Total Original Corrected 20% 10% 0% -10% -20% -30% -40% -50% -60% -70% -80% Figure 13. ATM Size Estimation Error for ARM7TDMI 8

9 The corrections made were fed back into the parameter file. The software library functions in the PHPBOLEF file were added to the parameter file as GS parameters, which override the 6:/ generic macro. The VL]H parameters for the 7,'7 and $9& macros were corrected and an additional 4 bytes were added to the corresponding parameters. The size estimation for the ARM920T was performed using the same procedure. The same observations were made, and no major differences were found. The result from code size estimation of the ARM920T is shown in Figure 14. ATM Size Estimation Error for ARM920T arbiter_sc arbiter_sorter collision_detector counter extract_cell2 first_cell lqm_arbiter3 msd_technique net_m2/sub_sort net_quid/memory2 net_state/memory sorter2 space_controller supervisor3 Total Original Corrected 20% 10% 0% -10% -20% -30% -40% -50% -60% -70% -80% Figure 14. ATM Size Estimation Error for ARM920T The estimation error for the ARM920T is displaced towards over-estimation compared to the ARM7TDMI. No actions were taken to investigate this phenomenon any further. Execution Time The execution time was measured by running the 'HEXJ executable image using a simple input scenario. The estimated execution time was extracted by running the 'R'HOD\ executable image. The original difference between estimation and measurement was 98%, which was clearly unacceptable. An investigation on the cause of the large error was carried out. From the size estimation was learned that Polis did not estimate the initialization of the internal variables of the modules, so an additional 12 cycles per variable were added to the 0LQ, 0D[, and $YJ execution time for initialization. The memory access functions FUHDWH, SXWGDWD, and JHWGDWD, that are called by some of the modules required separate modeling, because the average software library function execution time could not be applied to those functions. This was accomplished by adding the maximum and minimum execution time to the corresponding GS parameters in the parameter file. The $(0,7 macro caused major problems, because the execution time had large variations. Execution times between 6 and 139 cycles were recorded. The two modules that used $(0,7 most were QHWBPVXEBVRUW and VXSHUYLVRU. Those modules are also the ones that exhibit the largest deviations between estimated and measured execution time. To tackle the problem an average execution time for the $(0,7 macro was used. Using an average execution time is probably application dependent and might also depend on the input pattern. Therefore, this approach can 9

10 easily cause large errors if the number is used for other applications or input patterns. Correction for execution time errors cannot be applied directly in the Excel worksheet as is done for size corrections, because they depend on the execution path. A second pass is needed, changing the appropriate time variables in the parameter file to examine the improved model. The results are shown in Figure 15. 2nd Pass ATM Execution Time Estimation Error for the ARM7TDMI 30% 25% 20% 15% 10% 5% 0% -5% -10% -15% arbiter_sc arbiter_sorter collision_detector counter extract_cell2 first_cell lqm_arbiter3 msd_technique net_m2/sub_sort net_quid/memory2 net_state/memory sorter2 space_controller supervisor3 Total Max Min Avg Figure 15. ATM Execution Time Estimation Error for ARM7TDMI The size and execution time errors for the ARM7TDMI core, before and after correction are depicted in Table 1. Table 1. Estimation Errors for the ARM7TDMI Original Error Corrected Error Module Size Max Min Avg Size Max Min Avg arbiter_sc -18% -12% -19% -17% 7% 1% -4% -3% arbiter_sorter -23% -19% -24% -23% 7% -1% -4% -3% collision_detector -51% -17% -55% -41% 2% 11% -7% -1% counter -35% -5% -27% -7% -4% 11% 2% 10% extract_cell2-26% -36% -38% -37% 2% -1% 0% -1% first_cell -31% -48% -49% -49% 8% 7% 7% 7% lqm_arbiter3-12% -35% -37% -36% 7% 5% 6% 5% msd_technique -24% -58% -61% -59% -2% -5% -6% -6% net_m2/sub_sort -35% -18% -17% -17% -7% 12% 14% 13% net_quid/memory2-70% -98% -45% -93% -12% -4% -13% -3% net_state/memory -64% -97% -47% -94% -7% -1% -12% -12% sorter2-45% -27% -29% -28% -4% -3% -3% -3% space_controller -33% -47% -48% -47% 3% 3% 4% 3% supervisor3-42% -7% -63% -48% 4% 13% -4% -1% Total 33% -49% 51% 57% 0% -1% 0% -3% Abs Max Error -70% -98% -63% -94% 12% 13% 14% 13% Average Error 36% 38% 40% 43% 0% 3% -1% 0% The same procedure was applied to the ARM920T processor. After a first pass with very large estimation errors, corrections for software library functions were made. The result of a second pass estimation is shown in Figure

11 2nd Pass ATM Execution Time Estimation Error for the ARM920T 120% 100% 80% 60% 40% 20% 0% Max Min Avg -20% -40% -60% arbiter_sc arbiter_sorter collision_detector counter extract_cell2 first_cell lqm_arbiter3 msd_technique net_m2/sub_sort net_quid/memory2 net_state/memory sorter2 space_controller supervisor3 Total Figure 16. ATM Execution Time Estimation Error for ARM920T After a second pass, the estimation errors of the ARM920T processor were still much larger than those errors observed with the ARM7TDMI core. The minimum execution times were generally over-estimated, and the maximum execution times were generally underestimated. The size and execution time estimation errors before and after correction are summarized in Table 2. Table 2. Estimation Errors for the ARM920T Original Error Corrected Error Module Size Max Min Avg Size Max Min Avg arbiter_sc -5% -33% -3% -21% 7% -23% 15% -7% arbiter_sorter -11% -36% 1% -21% 5% -21% 29% 0% collision_detector -41% -35% -39% -48% -8% -7% 29% -7% counter -27% -11% -10% 22% -7% 4% 25% 44% extract_cell2-18% -47% -37% -41% -4% -17% 2% -6% first_cell -28% -57% -27% -43% -2% -12% 54% 19% lqm_arbiter3-6% -48% -23% -38% 4% -14% 30% 3% msd_technique -19% -77% -58% -67% -9% -47% 2% -23% net_m2/sub_sort -33% -43% 31% 0% 1% -23% 79% 36% net_quid/memory2-64% -97% -21% -90% -7% -2% 24% 3% net_state/memory -59% -96% -24% -91% -1% 0% 26% -8% sorter2-33% -39% -18% -29% -7% -19% 12% -3% space_controller -27% -58% -44% -50% -5% -17% 12% 0% supervisor3-38% -16% -50% -45% -7% 3% 31% 6% Total -28% -79% -24% -62% -3% -6% 30% 2% Abs Max Error 64% 97% 58% 91% 9% 47% 79% 44% Average Error -29% -50% -23% -40% -3% -14% 26% 4% Conclusions The processor characterization flow developed in this project was very useful for quick evaluation of different processor and memory configurations building on the ARM processor. The ARM Software Development Kit is sophisticated but lacks some documentation. 11

12 A valuable addition to the ORJSDUDP utility would be an expression builder that would allow a fully automatic conversion of the debugger log file to a parameter file. The $UFKDU tool is currently running on Windows NT, and several extensions are possible. The original goal was to let $UFKDU house the whole processor characterization flow, but the solution with a complementary Makefile was chosen due to lack of time. Using template files for macro profiling is generally a good idea, but is intrinsically hard to develop templates that will capture all the effects of software compilation. Further research is needed on the topic, and eventually a better set of template files should be developed. Meanwhile, the two-pass approach used in this project can be successfully applied. Starting with a set of template files that generate reasonable numbers, a second pass where closer analysis of a real application are taken into account will hopefully provide the required accuracy. The actual proposed methodology thus involves iteration. Going into detail, the (0,7 macro alone constituted one of the major problems encountered during validation. A deeper understanding of the execution time of the (0,7 macro is suggested. From the superficial analysis made on the execution of (0,7, a subdivision of the (0,7 macro seems to be needed. At least three different types of (0,7 were observed in the execution of the $70 example. In this project, one processor with cache and one without were deliberately chosen to investigate the effect that cache memory has on execution time. Studying the parameter file generated for the ARM920T, it immediately becomes obvious that the cache memory will affect the estimation. The minimum and maximum execution times for software library functions (defined by GS parameters) differ by up to a factor of four. The large estimation error after the second pass on the ARM920T processor can also be explained by the cache behavior. A cache miss will generate an execution time that is larger than the average execution time captured by the parameter. Thus, the maximum execution time will be larger than the estimated maximum execution time, in effect under-estimating the maximum execution time. Conversely, a cache hit will generate an execution time smaller than the average, in effect over-estimating the minimum execution time. The large estimation errors (up to 79%) motivate further investigation of cache behavior and cache estimation techniques. References [1] K. Suzuki, A. Sangiovanni-Vincentelli, Efficient Software Performance Estimation Methods for Hardware-Software Codesign, Proceedings of Design Automotion Conference DAC, [2] F. Balarin, M. Chiodo, A. Jurecska, H. Hsieh, A. L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, B. Tabbara, Hardware-Software Co- Design of Embedded Systems: The Polis Approach, Kluwer Academic Press, June 1997 [3] M. Lajolo, L. Lavagno, A. Sangiovanni-Vincentelli, Fast Instruction Cache Simulation Strategies in a Hardware/Software Co-Design Environment, Proceedings of the ASP-DAC 99 Asian and South Pacific Design Automation Conference,

ECL: A SPECIFICATION ENVIRONMENT FOR SYSTEM-LEVEL DESIGN

/ ECL: A SPECIFICATION ENVIRONMENT FOR SYSTEM-LEVEL DESIGN Gerard Berry Ed Harcourt Luciano Lavagno Ellen Sentovich Abstract We propose a new specification environment for system-level design called ECL.