Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering Department, University of Patras, Patras 26500, Greece mgalanis@ee.upatras.gr Abstract The execution time improvements achieved in a generic microprocessor system by employing a highperformance data-path are presented. The data-path acts as a coprocessor that accelerates computational intensive kernel regions thereby increasing the overall performance. The data-path has been previously introduced and it is composed by Flexible Computational Components (s) that can realize any two-level template of primitive operations. For evaluating the effectiveness of our coprocessor approach, several real-world DSP applications are mapped to the system. Study of the performance improvements relative to the microprocessor architecture and to the computational resources of the data-path is performed. Significant overall application speedups are reported that range from 1.75 to 3.95, having an average value of 2.72, while the overhead in circuit area is small. I. INTRODUCTION Embedded systems have to meet the increasing requirements for high-performance and reduced energy consumption of contemporary applications, like baseband processing of communication protocols and digital imaging. The majority of Digital Signal Processing (DSP) and multimedia applications usually spend most of their time executing a small number of time-critical regular code segments, called kernels. The kernels are commonly located in loop structures and exhibit high amounts of operation parallelism. Custom coprocessor hardware is typically utilized to realize critical kernels that would execute considerably more slowly on a microprocessor. Furthermore, for reducing the time-to-market of embedded systems, automated synthesis flows are required for constructing application-specific loop coprocessors from high-level specifications. Research activities in High-Level Synthesis (HLS) [1] and in Application Specific Instruction Processors (ASIPs) [2], [3] have proven that the use of complex computational structures, called templates or clusters, instead of using only primitive ones (like a single ALU) in custom data-paths improves performance. A template may be a specialized hardware unit or a group of chained units. Chaining is the removal of the intermediate registers between the primitive units improving the total delay of the combined units. In previous work [4], we have introduced a high-performance data-path that is composed by Flexible Computational Components (s). The is a combinational circuit consisting of a 2x2 array of Processing Elements (s). Each contains one ALU and one multiplication unit, whereas one of them is activated at each control step of the schedule. Due to the flexible connections inside the, any two-level complex template operation can be easily derived. A This work was partially funded by the Alexander S. Onassis Public Benefit Foundation smaller number of s cover a Data Flow Graph (DFG) compared with existing template-based data-paths. This allows the introduction of inter-component connectivity which enables inter- chaining resulting in performance improvements relative to primitive resource and template-based data-paths. A flow for synthesizing high-level descriptions to the -based data-path was also introduced. In this paper, we present the integration of the highperformance data-path to a system for improving application s performance. The available Instruction Level Parallelism (ILP) of computation intensive kernels is efficiently exploited by the flexible arithmetic units of the data-path leading to significant kernel acceleration. An instruction-set processor executes sequential irregular code segments and provides software programmability. For realizing a complete application to the generic singlechip system, a design flow is introduced. The system s performance is estimated by simulation. Analytical experiments are performed for assessing the effectiveness of the -based coprocessor and of the design flow. Eight real-life DSP applications are mapped on the six instances of the generic system. Important application speedups are reported as the design flow accelerates each application close to the ideal speedups. The rest of the paper is organized as follows. Section II presents existing research activities in synthesizing kernel coprocessors. Section III overviews the system s architecture and the data-path. The design flow and the synthesis method are given in section IV. The experiments are given in section V, while the conclusions are drawn in section VI. II. RELATED WORK Coprocessors are used for accelerating the computation of time-critical procedures relieving the system s microprocessor from these application parts. The PICO system [5] synthesizes nonprogrammable accelerators, under a given performance constraint, to be used as coprocessors for functions expressed as loop nests in C. The generated coprocessors consist of a synchronous array of one or more customized processor data-paths. A VLIW processor executes non-critical application code. Results are given for synthesizing over thirty loop nests into hardware. Nevertheless, results from executing complete applications were not provided. Also, the PICO targets a specific systolic array architecture that limits the applicability of the synthesis flow. Two different coprocessor architectures are presented in [6] that were evaluated on a JG encoder. An academic CPU is used as the host processor. Two new instructions have to be added in the instruction-set for coupling the customized

data-paths on the microprocessor. This limits the applicability of their approach since the modification of the instruction-set of a microprocessor, like an ARM, requires the modification of the compiler infrastructure which is a rather complicated task. Coprocessor circuits build from an automated synthesis flow can be also implemented in FPGA logic [7], [8]. However, in this case the performance is limited due to the higher delay of the FPGA logic relative to the custom ASIC implementation of the generated coprocessor. Furthermore, FPGAs consume more power and occupy considerably larger area than ASIC circuits. Area and power consumption are important design parameters in embedded systems. Our work generates coprocessor data-paths for improving kernels execution under a given constraint in the arithmetic units. The proposed data-path is implemented in ASIC logic and it is flexible enough to realize any two-level complex computational structure. Furthermore, no modification in the instruction-set of the microprocessor is required to execute an application. Thorough study is performed with eight realistic DSP applications and results are provided from simulating the execution of the overall applications on the system. III. SYSTEM ARCHITECTURE The proposed -based data-path is coupled with a host microprocessor for executing complete applications. An outline of a system-on-chip (SoC) architecture employing the coprocessor is shown in Fig. 1. The microprocessor, typically a RISC one, executes sequential control-dominant software parts, while the -based data-path realizes time-critical kernel code. The shared data RAM stores global data for the execution of application on the system. The processor and the data-path are connected to the data RAM via a global bus. Microprocessor Shared Data RAM Data-path Figure 1. Generic diagram of the system architecture. Data communication model between the -based data-path and the processor uses shared-memory mechanism. The shared memory is composed by the shared data RAM and a subset of the registers in the register bank of the coprocessor (section III-A). Scalar variables are exchanged via the shared registers, while global variables and data arrays are allocated in the shared data RAM. Both the microprocessor and the data-path have access to the shared memory. The communication process used by the processor and the preserves data coherency by requiring the execution of the processor and the to be mutually exclusive. A kernel is replaced with code that enables the datapath using a start signal and performs data communication by transferring live-in scalar variables, produced by the microprocessor, to the shared registers. Then, the data-path executes the kernel. Upon completion, it informs the microprocessor using a done signal, writes the live-out scalar variables for the code segments following the kernel in the shared registers, and writes global variables and array data located in the shared RAM. Then, the execution of the application is continued on the processor. The mutual exclusive execution makes the programming of the system architecture easier by eliminating complicated analysis and synchronization procedures. A. Coprocessor data-path An overview of the data-path that has been previously introduced in [4] is presented in Fig. 2. The high-degree of operation parallelism in DSP kernels is exploited by the Flexible Computational Components (). In [4] it was shown that the coprocessor efficiently accelerates kernels with high-performance. The coprocessor s data-path consists of: (a) the s, (b) a register bank, (c) interconnect which enables the inter- connections and connectivity to the register bank, (d) multiplexers for providing the proper inputs to the s, and (e) a controlunit. The register bank stores intermediate values among computations and input/output data located in RAM. The control unit manages the execution of the data-path every cycle. The data-path begins the execution of a kernel when the start signal is asserted in the control-unit by the host microprocessor. When the kernel execution is competed the control-unit informs the host processor using the done signal. start done Control unit Register bank Interconnect data I/O Figure 2. Overview of the proposed coprocessor. The s internal architecture is shown in Fig. 3a. The data-width of the is 16-bits, although higher bitwidths are supported. It consists of four Processing Elements (s), four primary inputs (in1, in2, in3, in4) connected to the register bank and two primary outputs (out3, out4) connected to the register bank. Four additional inputs (A, B, C, D) and two outputs (out1, out2) are connected either to the register bank or to another. As each performs a two-operand operation, multiplexers are used to select the inputs for the secondlevel s. These multiplexers also create the flexible intra- connections. In each there is an ALU and a multiplier unit where both of them are implemented as combinational circuits. At each control-step (c-step) of the schedule, either the multiplier or the ALU are activated. The ALU performs shifting, arithmetic (add/subtract), and logical operations. The flexible connections among the s inside a allow in easily realizing any desired operation combination, as the ones proposed in [1]-[3], by properly configuring the multiplexers of the. Examples of complex operations realized by an are shown in Fig. 3b. Thus, since a can implement templates by properly setting the connections inside the, highperformance can be achieved. In [4], it was shown that an average execution time reduction of 17% was

accomplished with -based data-paths relative to existing high-performance data-path. This improvement is due to the exploitation of chaining of operations inside the s (intra-component chaining) and inter- chaining owing to the direct connections among the s. To register bank or to another Out 1 A B In 1 In 2 In 3 In 4 1 3 2 B C C D 4 Out 3 Out 4 (a) To register bank or to another 2 2 2 2 A,B,C,D come from register bank or from other Out 2 - - >> >> Figure 3. (a) Architecture of the, (b) Examples of complex operations realized by the. (b) IV. DESIGN FLOW For implementing a complete application on the generic system of Fig. 1, a design flow is required that integrates the synthesis method of the -based coprocessor. The design flow used to realize applications in this work is shown in Fig. 4. Initially, a kernel identification procedure, based on profiling, outputs the kernels and the non-critical parts of the source code. For performing profiling, standard debugger/simulator tools of the development environment of a specific processor can be utilized. For ARM processors, the instruction-set simulator (ISS) of the ARM RealView Developer Suite (RVDS) can be used. Kernels are considered those code segments that contribute more than a certain amount to the total application s execution time on the processor. For example, parts of the code that account 10% or more to the application s time can be characterized as kernels. The non-critical code is compiled using a compiler for the specific processor and the software binary is produced. The kernels are synthesized using the procedure described in section IV-A. From the data-path architectures of all the kernels we derive a data-path that allows the hardware sharing of the kernels without degrading the execution time of each kernel. This sharing is feasible since the kernels are not concurrently executed. The control-unit of the multi-kernel coprocessor data-path activates the execution of a specific kernel each time. Area estimation Kernel identification Kernel code Non critical code synthesis Coprocessor arch. RTL coding Synthesis Area C description Simulation Time Figure 4. System design flow. Compilation Software binary Performance estimation The performance of the system is estimated by cyclelevel simulation that has as inputs the execution times of kernels on the coprocessor hardware and the execution times of the rest of the code on the microprocessor. The execution cycles of the kernels are reported by the synthesis method and the execution cycles of software on the microprocessor are extracted using an instruction-set simulator. For estimating the area of the generated coprocessor, the data-path architecture is described in synthesizable register-transfer level (RTL) VHDL. The produced VHDL is synthesized with a commercial tool, like Synplify ASIC or Synopsys Design Compiler, to estimate the area. The dark grey boxes in Fig. 4 represent the procedures modified or created by the `s for the specific flow, while the light grey boxes external tools used. A. Synthesis method The flow for synthesizing a kernel described in C to the proposed coprocessor data-path, for minimizing its execution time under given resource constraints, is illustrated in Fig. 5. First, the CDFG of the input kernel is created utilizing the SUIF2/MachineSUIF compiler infrastructures [9]. In this work, we utilize a hierarchical CDFG [10] for modeling data and control-flow dependencies. The control-flow structures, like branches and loops, are modeled through the hierarchy, while the data dependencies are modeled by Data Flow Graphs (DFGs). Existing and custom-made compiler passes are used for the CDFG creation. Afterwards, optimizations are applied to the kernel s CDFG for more efficient synthesis. Optimizations implemented in the synthesis methodology are tree-height reduction, dead code elimination, common sub-expression elimination and constant propagation. MachineSUIF compiler passes have been developed for the automatic application of the described optimizations on a kernel s CDFG. Kernel (.c) Front-end CDFG Optimizations Scheduling Binding to s Data-path specification Data-path arch. Figure 5. Coprocessor synthesis method. The optimized CDFG is input to the developed scheduler for the data-path. If the kernel s CDFG is composed by more than one DFG, the scheduler hierarchically traverses the CDFG and schedules one DFG at a time. The scheduling is a resource-constrained problem with the goal of execution cycles minimization, since the number and type of s (e.g. three s) in the data-path is input to the synthesis script. A proper list (priority)-based scheduler has been developed. The priority function [10] of the scheduler is derived by properly labeling the DFG nodes (operations). Particularly, the nodes are labeled with weights of their longest path to the sink node of the DFG, and they are

ranked in decreasing order. The most urgent operations are scheduled first. The resource constraints for the scheduler are determined by the total number of s at the first rows of all the s in the data-path. If there are p s in the data-path, there are 2p s in the first rows, since each row consists of 2 s. Thus, 2p primitive operations (ALU and/or multiplications) can be executed in parallel at each clock cycle of the schedule. For example, if there are three s in the data-path, six operations can be executed in parallel at every cycle of the schedule. The input to the binding step is the scheduled CDFG. The binding algorithm maps row-wise the CDFG operations to the s. Idle units inside s are removed by a procedure called instantiation. In particular, when a unit (ALU or multiplier) in a and/or a whole is not used at any control-step of the scheduled CDFG, then it is not included in the final datapath. A detailed description of the binding algorithm is given in [4]. After the binding, the execution cycles of the kernel are outputted. The cycles of the synthesized kernel have clock period set to the delay of the instantiated with the longest combinational delay, for having unit execution delay for all the instantiated s. The delay of an -based data-path is largely determined by the critical path of an resource, since the proposed datapath can be considered as a resource-dominated circuit, as it targets DSP kernels [10]. After the binding to s the data-path specification procedure takes place. The size of the register bank is defined by the longest lifetime of all the values produced by the s. The number and type of multiplexers, the interconnections among the s, the interconnection of the s to the registers and the states of the control-unit are also determined. A prototype tool was developed in C for the automation of the scheduling, binding and the specification procedure. V. EXRIMENTS A. Experimental set-up Two 16-bit -based data-paths are considered in the experiments. The first coprocessor data-path (1) is composed by two s, while the second one (2) by three s. Each one of these data-paths is a coprocessor to a 32-bit ARM processor. Three ARM processors are used each time in the platform. These processors are: (a) an ARM7 clocked at 133 MHz, (b) an ARM9 clocked at 250 MHz, and (c) an ARM10 having clock frequency of 325 MHz. These clock frequencies were taken from reference designs from the ARM website and they are considered as typical for these processors at 130nm process. In [4], we synthesized and laid-out an RTL VHDL description of the unit with Synplify ASIC tool using 130nm CMOS process and it was found that the delay for the equals 4.03ns. For accommodating the extra delays caused by the register bank, the multiplexers, the interconnection and the control-unit, we set the clock period for both data-paths to 5ns. Thus, a clock frequency of 200MHz is assumed for having unit execution delay for the s. The extra delay overheads were estimated by synthesizing and laying out representative benchmarks, such as DFGs from [4] and kernels extracted from the applications of this work. The eight real-world DSP applications, described in C language, used in the experiments are: a JG encoder, an IEEE 802.11a OFDM transmitter, a wavelet-based image compressor, a medical imaging technique called cavity detector, an image edge detector, a JG decoder, a GSM speech encoder and a GSM speech decoder. The ARM RVDS (version 2.2) was used for estimating the execution cycles of software parts. The profiling results showed that each application is composed by at most 4 kernels. The number of kernel for each application is illustrated in Fig. 7. Parts of the code that account 10% or more of the application s time were characterized as critical. It was observed that a threshold smaller than 10% leads in marginal additional improvements. These kernels are innermost loops and they consist of word-level operations (ALU, multiplications, shifts) that match the granularity of the ALU and the multiplier units inside an. B. Results The execution times and the overall application speedups for the eight applications are presented in Table I. The performance of the applications executed on the six systems is estimated via simulation, using the proposed design flow. Time sw represents the software execution time of the whole application on a specific microprocessor (Proc.). The ideal speedup (Ideal Sp.) is the application speedup that would ideally be achieved, according to Amdahl s Law, if application s kernels were executed on the in zero time. Time system corresponds to the execution time of the application when executing the critical code on the data-path. All execution times are normalized to the software execution times on the ARM7. The Sp. is the estimated application speedup, after utilizing the developed design flow, over the execution of the application on the microprocessor. The estimated speedup is calculated as Sp= Time sw / Time system. The average values, as well as, the geometrical means of the execution times and of the speedups are also illustrated. From the results given in Table I, it is evident that significant overall performance improvements are achieved when critical software parts are synthesized on the s. It is noticed from Table I that the largest overall application performance gains are achieved for the ARM7 extended architectures since the ARM7 exhibits the highest Cycles Per Instruction (CPI) and it has the slowest clock relative to the rest two ARM processors. The average application speedup of the eight DSP benchmarks for the ARM7 extended systems (for both 1 and 2) is 2.90, for the ARM9 is 2.68, while for the ARM10 systems is 2.52. Thus, even when the -based data-paths are coupled with a modern embedded processor, as the ARM10, which is clocked at a higher clock frequency, the application speedup over the execution on the processor core is significant. For the case of synthesizing the kernels on data-paths including three s (2 case), the speedups are somewhat larger than the 1-based data-paths due to the larger number of s available in each control-step of the schedule. However, even though the kernels are executed faster on the 2, the application speedup slightly increases due to the fact that the non-critical code segments are executed on the microprocessor. The average estimated application speedup is 2.67 for the microprocessor architectures coupled with the 1 data-

paths. When the processor cores are extended with the 2-based coprocessors the average speedup, for the eight applications and the three ARM processors, is 2.72. From Table I, it is inferred that the reported speedups for each application and for each processor type are close to theoretical speedup bounds, especially for the case of the ARM7 systems. Thus, the proposed design flow quite effectively utilized the processing capabilities of the based data-paths for improving the overall performance of the applications near to the ideal speedups. We note that it was found by experimenting with the benchmarks of this paper that few parts of each application can be executed in parallel on the processor and on the. A trivial performance increase relative to the mutual exclusive execution was also reported. Such minor improvements cannot offset the benefits of the simpler programming of the system architecture due to the exclusive execution. TABLE I. COMPARISON OF EXECUTION TIMES FOR SOFTWARE AND SOFTWARE WITH -BASED COPROCESSOR Application Proc. Time sw Proc./1 Proc./2 Ideal sp. Time system Sp. Time system Sp. ARM7 1.000 3.96 0.272 3.68 0.270 3.70 JG enc. ARM9 0.461 3.24 0.157 2.94 0.155 2.97 ARM10 0.301 3.16 0.111 2.71 0.109 2.76 ARM7 1.000 3.54 0.308 3.25 0.306 3.27 OFDM trans. ARM9 0.485 3.43 0.157 3.09 0.155 3.13 ARM10 0.344 3.23 0.122 2.82 0.120 2.87 ARM7 1.000 2.51 0.454 2.20 0.448 2.23 Compressor ARM9 0.424 2.32 0.211 2.01 0.206 2.06 ARM10 0.283 2.21 0.161 1.76 0.156 1.81 ARM7 1.000 2.38 0.494 2.02 0.491 2.04 Cavity det. ARM9 0.480 2.29 0.258 1.86 0.255 1.88 ARM10 0.355 2.17 0.203 1.75 0.200 1.78 ARM7 1.000 2.61 0.409 2.44 0.406 2.46 Edge det. ARM9 0.498 2.54 0.213 2.34 0.210 2.37 ARM10 0.367 2.49 0.162 2.27 0.159 2.31 ARM7 1.000 4.17 0.258 3.88 0.253 3.95 JG dec. ARM9 0.418 3.85 0.118 3.54 0.114 3.67 ARM10 0.273 3.64 0.084 3.25 0.079 3.46 ARM7 1.000 3.05 0.352 2.84 0.349 2.87 Gsm enc. ARM9 0.426 2.93 0.157 2.71 0.153 2.78 ARM10 0.295 2.88 0.113 2.61 0.109 2.71 ARM7 1.000 2.82 0.365 2.74 0.364 2.75 Gsm dec. ARM9 0.422 2.77 0.157 2.69 0.156 2.71 ARM10 0.292 2.74 0.111 2.63 0.109 2.68 Average 2.96 2.67 2.72 Geo. mean 2.90 2.61 2.65 In order to provide an insight into the cost of coupling an data-path with a microprocessor, we note that the area at 130nm for ARM7 is 2.4mm 2, for ARM9 is 3.2mm 2, while for ARM10 is 6.9mm 2. The maximum area for the data-paths is reported for the OFDM transmitter for the 2 case and equals 0.471mm 2. Hence, important speedups have been achieved, by using the data-path as a coprocessor, with a relatively small area overhead. VI. CONCLUSIONS The speedups from executing eight DSP applications on a SoC that integrates a high-performance coprocessor were presented. The coprocessor uses flexible arithmetic units that can realize complex operations. The application speedups have an average value of 2.72 for six instances of a generic system. These improvements come with a small increase in the system s area. ACKNOWLEDGMENTS This work was partially funded by the Alexander S. Onassis Public Benefit Foundation. REFERENCES [1] M. R. Corazao et al., Performance Optimization Using Template Mapping for Datapath-Intensive High-Level Synthesis, in IEEE Trans. on CAD, vol. 15, no. 2, pp. 877-888, August 1996. [2] J. Cong et al., Application-Specific Instruction Generation for Configurable Processor Architectures, in Proc. of the ACM FPGA 04, pp. 183-189, 2004. [3] R. Kastner et al., Instruction Generation for Hybrid Reconfigurable Systems, in ACM TODAES, vol. 7, no. 4, pp. 605-627, October 2002. [4] M. D. Galanis, G. Theodoridis, S. Tragoudas, C. E. Goutis, A High Performance Data-Path for Synthesizing DSP Kernels, to be appear in IEEE Trans. on CAD. [5] R. Schreiber et al., PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators, in the Journal of VLSI Processing, Springer, vol. 31, no. 2, pp. 127-142, 2002. [6] S. L. Shee et al., Novel Architecture for Loop Acceleration: A Case Study, in Proc. of CODESISSS 05, pp. 297-302, 2005. [7] T.J. Callahan et al., The Garp Architecture and C Compiler, in IEEE Computer, vol. 33, no. 4, pp 62-69, April 2000. [8] G. Stitt et al., Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode, in Proc. of CODESISSS 05, pp. 285-290, 2005. [9] SUIF2, http://suif.stanford.edu/suif/suif2/index.html, 2005. [10] G. De Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994.