Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts

Size: px

Start display at page:

Download "Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts"

Kory King
6 years ago
Views:

1 Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton, New Jersey ABSTRACT This paper presents a design methodology for a high-performance, programmable video signal processor (VSP). The proposed design methodology explores both technology-driven hardware tradeoffs and application-driven architectural tradeoffs for optimizing cost and performance within a class of processor architectures. In particular, this methodology allows concurrent consideration of these competing factors at different levels of design sophistication, ranging from early design exploration towards full processor simulation. We present the results of this methodology for an aggressive Very- Long-Instruction-Word (VLIW) video signal processor design and discuss its utility for other programmable signal processor designs. Keywords: processor design, video signal processing, multimedia architecture, VLIW, video compression 1. INTRODUCTION The extraordinary growth of the multimedia industry has generated considerable demand for digital video in today's applications. General-purpose processors and traditional DSPs are unable to efficiently support the intense computational costs of digital video, so video signal processors (VSPs) have been developed to meet this demand. Dedicated video signal processors are now available for compression as well as other special video applications. Dedicated VSPs, however, cannot support the rapid evolution of new and existing video applications. Consequently, the need for greater functionality, as well as increased cost and time-to-market pressures, will push the video industry towards programmable video signal processors. Programmable VSPs offer the necessary flexibility for supporting multimedia applications at lower cost and higher performance than workstation microprocessor architectures. Unlike microprocessor architecture design, the design of programmable VSPs is an immature field. Typical microprocessor design uses existing processor designs, CAD tools, and benchmarks to provide a starting point in the design process. Existing processor designs provide an architectural reference point from which design modifications can be made. Then, using the application characteristics defined by the benchmarks, and simulating potential designs with the CAD design tools, it is possible to readily evaluate the merits of various architectural tradeoffs. With programmable VSPs, this is not the case. There are no design tools, no benchmarks, and only a small number of existing VSPs, few of which are programmable. This provides a very limited foundation from which to begin a new design. Therefore, approaching the problem of programmable VSP design, or the design of any new computer architecture field, requires a different strategy. One potential strategy is our proposed design methodology. The most accurate method of architectural assessment involves circuit-level timing simulation of full processor layout and cycle-level simulation of full applications based on optimized, compiled code. Unfortunately, the development of such a simulation model is extremely expensive and can only be accomplished for a tiny subset of the design space. We require a more practical design strategy that obtains results over a larger design space. Such a design methodology has been devised that incorporates early exploration both for technology-driven design parameters such as circuit performance and for instruction-level behavior. The primary advantage of this methodology is that it allows us to concurrently consider these competing factors throughout the range of the design process, from early design exploration to full processor simulation. This document will continue to examine the specifics of the proposed design methodology in the section 2. Following in section 3, we present an example of how this methodology has been applied to the design of an aggressive Very-Long-Instruction-Word (VLIW) video signal processor. And finally, conclusions as to the effectiveness of this method will be presented in section 4.

2 2. DESIGN METHODOLOGY The basis of our design methodology is that it depends on an early exploration phase to generate the largest possible design space in which to search for the best candidate architectures. Early exploration provides us with an approximate model of the final design upon which we can perform more detailed evaluation. The primary steps in the initial exploration phase of the design methodology are: 1. Choose an architectural paradigm. 2. Perform detailed, parameterizable, transistor-level design of key modules. Use area and performance data to define unique design space. Construct candidate architectures based on module area and performance costs. Estimate cycle time. 3. Evaluate candidate architectures by hand scheduling key VSP kernels onto each architecture using a variety of well-known compilation strategies. The results of these experiments refine the design space and direct the implementation of the compiler. Within the reduced design space, detailed simulation is again performed on a more limited set of designs using a wide array of applications, a prototype compiler, and a full architectural simulator with real application data. This iterative application of continually more limited and detailed evaluations will eventually result in a final programmable video signal processor design Architectural paradigm The first step in the design of the programmable VSP is the choice of the basic processor framework. There are a number of possible architectural paradigms available to choose from. However, we are primarily interested in architectures that are easily programmable from a high-level language (HLL). Some potential choices include superscalar, VLIW, and multiprocessor DSP. Superscalar architectures are generally not suitable in that they are not optimized for signal processing characteristics, that is for high data throughput and real-time constraints. VLIW and multiprocessor DSP are the common choices currently being used by existing programmable VSPs 1, 2, 3, 4. For aggressive architectures with high degrees of parallelism and deep sub-micron manufacturing processes, multiprocessor DSP architectures are likely to be less suitable because of the large memory requirements. The basic multiply-accumulate architecture of DSPs requires two memory reads and one memory write per cycle. This requires three memory ports per functional unit, which, for multiple parallel functional units, can lead to poor speed and area performance. Multiprocessor DSP architectures, however, have the advantage of a strong existing compiler technology. Many early programmable video signal processor designs are based on this processor framework for this reason 1. VLIW architectures provide a popular alternative because of the high degree of parallelism and the high clock rates made possible by static scheduling. Because the VLIW architecture is a load/store architecture, there is not as large a demand on the memory as with the multiprocessor architecture. The primary drawback of the VLIW architecture is that it does not have the existing technology base of current DSP architectures. For both the VLIW and multiprocessor architectures, when using high degrees of parallelism and fast clock rates, area and speed are likely to be an issue due to their decreasing performance from the excessive number of ports in memory and register files, as well as increasing costs for bypassing in pipelines. However, much of this can be minimized by paying careful attention to VLSI design and, if necessary, using a distributed architecture with a high-speed interconnect, as will be seen with our example VLIW VSP processor in the next section Architectural assessment After choosing the basic processor framework, it is necessary to perform an analysis to determine the potential performance of the processor. The limitations on performance are primarily dictated by the tradeoff between speed and area, where speed determines how fast processor modules will run and area determines the number and type of modules that may fit on the chip. In order to meet the intense performance requirements of video signal processing, it is necessary to achieve high clock rates but still maintain reasonable area requirements in order to allow for the provision of more on-chip resources. These tradeoffs may be determined by performing detailed VLSI simulation of the processor design.

3 Unfortunately, full processor simulation is not practical during the intial stages of design. A more reasonable solution is to instead perform detailed transistor-level design of key parameterizable modules only. This is considerably less expensive, but still provides sufficient information for early design exploration. In the case of a programmable VSP, the key modules include high-bandwidth interconnect, high-connectivity register files, and high-speed local memories. Other important modules may include speed critical functional units such as multipliers and dividers. This detailed design is particularly important when implementing an aggressive architecture using a more advanced manufacturing process. Area and performance data from these designs define a unique design space for the processor in a given implementation process. This design space provides critical information about technology-driven hardware tradeoffs and the effects upon area and speed performance. Therefore, given target speed and area constraints, these data may be used to determine the maximum size and number of ports for memory and register files, length and degree of bypassing in the pipeline, size of the interconnect network, number, type, and arrangement of functional units, as well as numerous other aspects of the design space. Based on these module cost and performance data, the system designer constructs a number of potential architecture designs. These candidate architectures will be used for an application assessment to determine their utility in typical programmable VSP applications. Later, an automated design exploration tool based on analytical performance models is used to suggest additional interesting candidates for evaluation Application assessment Once candidate architectures are available, key VSP kernels are hand scheduled onto these architectures using a variety of compilation strategies. This serves a two-fold purpose. First, it provides early performance estimates as to the effectiveness of each of the candidate architectures and provides an early indication of any potential architectural bottlenecks. Secondly, the fact that skilled system designers are performing the scheduling by hand by allows assessment of the effectiveness of known compilation strategies under the proposed architectures. Performing scheduling using a compiler can conceal some of the insights that may be found with hand-coding. While hand scheduling does not provide optimum results, a first-generation compiler is likely to have many inefficiencies and provide no more accuracy than hand-coding. Unfortunately, as we mentioned before, there is no standard set of VSP benchmarks, so it is necessary to devise our own set of VSP applications for scheduling. Also, it is not practical to attempt to hand-code full applications over numerous candidate architectures as well as doing so using numerous different compiler methods, particularly before an optimizing compiler is available. Instead, it is necessary to take a more cautious approach and only schedule key VSP kernels. While kernels do not exactly indicate the characteristics of full applications, kernels tend to dominate signal processing code to an even greater degree than in general-purpose code, so we can expect reasonably accurate results. The choice of kernels to use in our first design was primarily dictated by the current trend in video applications towards compression and decompression. The primary routines involved in these functions include motion estimation, discrete cosine transform (DCT), variable bit rate (VBR) coding, and color-space transformation. While there are numerous other video application kernels aside from these four, between them they display many of the characteristics of video applications and thus are believed to provide a reasonable kernel set for early exploration. The choice of kernels and applications can be modified later when performing more detailed evaluation of the processor architecture. The early performance results from these hand-coding experiments are used to guide the implementation of the compiler. Coupled with the results of the detailed VLSI area and performance data, they can be used to refine the design space and target a more limited range of designs. Within the new design space, more detailed simulation is performed with a wide range of applications, a prototype compiler, and a full architectural simulator with real application data in order to determine the final processor architecture.

4 3. TARGET PROCESSOR This methodology is being used to design a real processor. This design targets operations per cycle at clock rates in excess of 500 MHz. To enable such high speeds and to facilitate a high degree of parallelism, we have chosen to use a VLIW architecture and a 0.25µ manufacturing process. VLIW architectures have only recently become possible to implement on a single-chip and so many architectural structures are feasible for the first time 5, creating an even more challenging implementation problem. We shall present how our design methodology manages to deal with the necessary design issues Architectural assessment As specified by the design methodology, following adoption of the basic processor framework it is necessary to perform detailed transistor-level simulation. Using a 0.25µ process, detailed designs have been made of parameterizable versions of key modules, including a high-bandwidth interconnect, high-connectivity register file, and high-speed local memory 6, 7. In the particular case of the register file and local memory, a single global register file and memory simply are not feasible supporting functional units at such high speeds. In order to achieve speeds on the order of 500 MHz, register files and memory are relatively limited in both size and the number of ports they may support. Therefore we selected a distributed cluster architecture, similar to the one in Labrousse, et. al. 8, where each cluster has its own register file and local memory. The high-speed interconnect then serves the purpose of connecting these clusters. Since a VLIW processor is a statically scheduled machine, the burden falls on the compiler to efficiently schedule the code. However, architectural tradeoffs may be made in some cases to ease the job of the compiler. One common architectural improvement is to provide higher connectivity of units within the processor. Therefore, in this case, it is desirable to make the clusters as large as possible, include full bypassing of the pipeline, but still stay within acceptable area and performance constraints. From the results of the detailed module designs, we found that clusters of up to 4 issue slots were feasible allowing register files of up to 256 registers and a maximum of 32 KB for single-ported local memories, while still maintaining a clock rate of 650 MHz. The processor could support up to 8 clusters with all the issue slots connected by a single-cycle 32x32 crossbar interconnect. However, if we are willing to sacrifice connectivity for higher speed, we can instead obtain a 850 MHz machine with 16 clusters and 2 issue slots per cluster, local memories of up to 16 KB of memory, and register files of 64 registers. In this case, however, only a 16x16 crossbar interconnect may be supported, allowing for only one slot per cluster to be connected to the crossbar. We didn't explicitly perform detailed design of any of the functional units, but external sources 9,10 indicate the performance of similar functional units designed in a 0.25µ process. From these we anticipate that we can support 4-stage pipelines with full function ALUs, 16-bit shifters, 8x8 multipliers, and load/store units with simple addressing modes (i.e. direct or register-indirect). However, by moving to a 5-stage pipeline we can support more complex addressing modes (i.e. indexed and base displacement) as well as a 16x16 multiplier. This would add single-cycle load-use and multiply-use hazards. The above combined data regarding the potential cluster sizes and arrangements and pipeline lengths and characteristics becomes the design space for this programmable VSP. So it can be see how detailed transistor-level simulation of key hardware modules is invaluable in defining the design space of the processor. We identified seven likely candidate architectures, based on three criteria from the design space. Within the seven architectures, four used the higher connectivity 8-cluster/4-issue-slot model, whereas the other three supported the 16- cluster/2- issue-slot model. Three of the architectures supported 4-stage pipelines, while the other four used a 5-stage pipeline. And finally, five of the models used 8x8 multipliers versus the 16x16 multipliers supported in the other two models. More detailed regarding the candidate architectures may be found in Wolfe, et al 11. A design space exploration tool is also currently being tested in an attempt to find other important design space criteria and identify any additional likely candidate architectures.

5 3.2. Application assessment The seven candidate architectures created from the design space were then programmed with six different kernels from various compression standards (i.e. JPEG, MPEG), including full search motion estimation, three-step search motion estimation, traditional 2-D discrete cosine transform (DCT), row-column DCT, variable bit rate (VBR) coding, and RGB to 4:2:0 YCrCb color-space transformation. Each of these kernels has different application characteristics allowing us to examine various aspects of the candidate architectures. For example, motion estimation requires considerable memory usage, DCT places a heavy load on the multipliers, and VBR coding has long dependency chains and limited parallelism. These kernels were then hand scheduled using a variety of different compiler strategies. Initially, the kernels were scheduled onto each architecture in a sequential manner (i.e. using only one slot of 32 available slots) using only traditional scalar optimizations. This provides a reference point for comparison with the more aggressive parallel compiler methods. And, in cases where it was feasible, a sequential implementation with one level of unrolling was also performed. This provided a better reference against the primary parallel compilation method used, which was software pipelining. Other compilation methods included SIMD scheduling across clusters, predication, list scheduling, increasing degrees of unrolling, blocking, and the addition of special operations. The results of these experiments showed that while all the candidate architectures were relatively balanced and no particular model was particularly deficient in load/store units, multipliers, or shifters, among the three criteria upon which the candidate architectures were based, in two of the three cases, one model proved to have clearly superior performance. In the case of the multipliers, the 8x8 multiplier model had considerably slower performance than the 16x16 multiplier model. The other case was more surprising; the lower connectivity cluster model, the 16-cluster/2-issue-slot model, typically showed better performance than the 8-cluster/4-issue-slot model. This can be generally attributed to the fact that it had additional resources (with one multiplier, load/store unit, shifter per cluster, there are twice as many of these limited resources) and so was less likely to be limited by resources and more likely to be limited by the number of issue slots, and because this model has a 30% faster clock cycle. However, because hand scheduling was involved, we believe the higher connectivity model is likely to have equivalent performance when a compiler is performing the scheduling. More detailed results of the kernel scheduling can be found in Wolfe, et al 11. The results were also invaluable in evaluating the effectiveness of the various compiler strategies. While software pipelining appeared to be the dominant parallel compiler method, in cases where unrolling was used, particularly in the cases of multiple levels of unrolling, list scheduling often performed nearly as well. The compilation strategy that appeared to be most valuable in general was unrolling. We often used an SIMD approach, performing the same operations on different data in different clusters, but it was also feasible to split up the operations over multiple clusters to shorten the length of the code. This allowed for multiple levels of unrolling while still keeping the code length to a size that would fit within the instruction cache. These observations will prove useful while undertaking the design of the compiler. With the detailed evaluations provided by the performance of the kernels, we are now beginning to refine the design space and prepare for more detailed evaluation of this design space. A design exploration tool being is designed to look for other candidate architectures, design of the compiler is currently under way, a RTL-simulator is being designed that will allow more accurate evaluation of programs, and more extensive research is being done for finding a more complete set of video applications. Eventually we intend to have sufficient design tools for complete evaluation of our architectures using compiled code from complete production applications. 4. CONCLUSION While the proposed design methodology provides no guarantees for finding the optimal design of a programmable video signal processor, even with the limited foundation with which we began our processor design, we were still able to come up with a reasonably successful design. Based on our results, by efficiently using processor resources, we can maintain sustained performance exceeding 15 GOPs for large periods of time. Even full-search motion-estimation, generally believed to be the most time consuming routine in video compression, can be done in real-time using only 33%-46% of the compute time for CCIR-601 resolutions (720x480). Therefore, we achieved a processor design that not only offers performance comparable to or better than today's best dedicated VSPs, but also provides the valuable benefit of programmability

6 The most valuable advantage of this design methodology is that it does not specifically relate to programmable video signal processors. By first choosing an architectural paradigm, then performing detailed simulation on key modules, and finally performing hand-scheduling of key application kernels onto the candidate architectures identified within the resulting design space, it should be possible to generate a design for any type of signal processor, or even, potentially, any general-purpose processor. 5. REFERENCES 1. R. Gove, "The MVP: A Highly-Integrated Video Compression Chip," in Proc. of the Data Compression Conf. 1994, pp , C. Hansen, "MicroUnity's MediaProcessor Architecture," IEEE Micro, Aug. 1996, pp P. Foley, "The Mpact Media Processor Redefines the Multimedia PC," Proc. of COMPCON '96, pp , S. Rathnam and G. Slavenburg, "An Architectural Overview of the Programmable Multimedia Processor, TM-1," Proc. of COMPCON '96, pp J. Gray, A. Naylor, A. Abnous, and N. Bagherzadeh, "VIPER: A 25-MHz 100 MIPs Peak VLIW Microprocessor," Proc. of 1993 IEEE Custom Integrated Circuits Conf., pp , S. Dutta, "VLSI Issues for Video Signal Processing," Ph.D. Thesis, Princeton Univ., S. Dutta, A. Wolfe, W. Wolf, and K. O'Connor, "Design Issues for a Very-Long-Instruction-Word VLSI Video Signal Processor," in VLSI Signal Processing IX, pp , Oct J. Labrousse and G. Slavenburg, "A 50 MHz Microprocessor with a VLIW Architecture," Proc. Inter. Solid State Circuits Conf., San Francisco, N. Ohkubo, et al., "A 4.4ns CMOS 54x54-b Multiplier Using Pass-Transistor Multiplexer," Proc. of 1994 IEEE Custom Integrated Circuits Conf., pp , M. Suzuki, et al., "A 1.5ns, 32b CMOS ALU in Double Pass-Transistor Logic," Inter. Solid State Circuits Conf., pp , A. Wolfe, J. Fritts, S. Dutta, and E. S. Fernandes, "Datapath Design for a VLIW Video Signal Processor," scheduled to appear in Third Inter. Symp. on High-Perf. Comp. Arch., Jan

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,