APPLICATION SPECIFIC PROCESSORS

Size: px

Start display at page:

Download "APPLICATION SPECIFIC PROCESSORS"

Arthur Dalton
6 years ago
Views:

1 APPLICATION SPECIFIC PROCESSORS

2 THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE VLSI, COMPUTER ARCHITECTURE AND DIGITAL SIGNAL PROCESSING Consulting Editor Jonathan Allen Other books in the series: QUICK-TURNAROUND ASIC DESIGN IN VHDL: Core-Based Behavioral Synthesis M.S. Romdhane, V.K. Madisetti, J.W. Hines ISBN: ADVANCED CONCEPTS IN ADAPTIVE SIGNAL PROCESSING W. Kenneth Jenkins, Andrew W. Hull, Jeffrey C. Strait ISBN: SOFTWARE SYNTIIESIS FROM DATAFLOW GRAPHS Shuvra S. Bhattacharyya, Praveen K. Murthy, Edward A. Lee ISBN: AUTOMATIC SPEECH AND SPEAKER RECOGNITION: Advanced Topics, Chin-Hui Lee, Kuldip K. Paliwal ISBN: BINARY DECISION DIAGRAMS AND APPLICATIONS FOR VLSI CAD, Shin-ichi Minato ISBN: ROBUSTNESS IN AUTOMATIC SPEECH RECOGNITION, Jean-Claude Junqua, Jean-Paul Haton ISBN: HIGH-PERFORMANCE DIGITAL VLSI CIRCUIT DESIGN, Richard X. Gu, Khaled M. Sharaf, Mohamed I. Elmasry ISBN: LOW POWER DESIGN METHODOLOGIES, Jan M. Rabaey, Massoud Pedram ISBN: MODERN METHODS OF SPEECH PROCESSING, Ravi P. Ramachandran ISBN: LOGIC SYNTHESIS FOR FIELD-PROGRAMMABLE GATE ARRAYS, Rajeev Murgai, Robert K. Brayton ISBN: CODE GENERATION FOR EMBEDDED PROCESSORS, P. Marwedel, G. Goossens ISBN: DIGITAL TIMING MACROMODELING FOR VLSI DESIGN VERIFICATION, Jeong Taek Kong, David Overhauser ISBN: DIGIT-SERIAL COMPUTATION, Richard Hartley, Keshab K. Parhi ISBN: FORMAL SEMANTICS FOR VHDL, Carlos Delgado Kloos, Peter T. Breuer ISBN: ON OPTIMAL INTERCONNECTIONS FOR VLSI, Andrew B. Kahng, Gabriel Robins ISBN:

3 APPLICATION SPECIFIC PROCESSORS Edited by Earl E. Swartzlander, Jr. University of Texas at Austin KLUWER ACADEMIC PUBLISHERS Boston / Dordrecht / London

4 Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box AH Dordrecht, THE NETHERLANDS Library of Congress Cataloging-in-Publication Data A C.I.P. Catalogue record for this book is available from the Library of Congress. ISBN-13: DOl: / e-isbn-13: Copyright «:l 1997 by Kluwer Academic Publishers Softcover reprint of the hardcover 1st edition 1997 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts Printed on acid-free paper.

5 TABLE OF CONTENTS Preface... i x 1. Variable Precision, Interval Arithmetic Processors... 1 Michael J. Schulte 1.1 Introduction Variable-Precision, Interval Arithmetic Previous Research Processor Implementation Area, Delay and Execution Time Estimates Variable-Precision, Interval Arithmetic Algorithms Conclusions Modeling the Power Consumption of CMOS Arithmetic Elernents Thomas K. Callaway 2.1 Introduction Previous Research Parallel Adders Parallel Multipliers Conclusions Fault Tolerant Arithmetic Yuang-Ming Hsu 3.1 Introduction Previous Research The Time Shared TMR Technique VLSI Designs and Performance Evaluations Conclusions... 85

6 vi Table of Contents 4. Low Power Digital Multipliers Edwin de Angel 4.1 Introduction Related Research Digital Multipliers CMOS Multipliers Combinational Self-Timed Multipliers with Bypassing Logic Results s. A Unified View of CORDIC Processor Design Shaoyun Wang and Vincenzo Piuri 5.1 Introduction The CORDIC Algorithm Combined Architectures Pipelined Architectures Architectural Evaluation Design Guidelines and Conclusions Multidimensional Systolic Arrays for Computing Discrete Fourier Transforms and Discrete Cosine Transforms Hyesook Lim 6.1 Introduction Multidimensional OFf and ocr by Multidimensional Systolic Array Fast Fourier Transform Computation by Multidimensional Systolic Array Prime-Factor Decomposed Computation by Multidimensional Systolic Array Conclusions

7 Application Specific Processors vii 7. Parallel Implementation of a Fast Third-Order Volterra Filtering Algorithm Hercule Kwan 7.1 Introduction Volterra Filtering in the Time and Frequency Domain Parallel Implementation on DSPS Performance Evaluation Applications to Nonlinear Communication Channels Future Research Design and Implementation of an Interface Control Unit for Rapid Prototyping Mohammad S. Khan 8.1 Introduction Related Work Interface Control Unit ICU Protocol Hardware Design of the ICU Conclusions Index... 2S 1

8 PREFACE Application specific processors are not a new idea. For example, difference engines such as those of Muller, Babbage and the Scheutzs are excellent examples of very early digital application specific processors. These machines were designed for the efficient production of numerical tables. These are stand alone application specific processors. Such processors are very different from geneml purpose computers. General purpose computers sacrifice performance in order to achieve flexibility and genemlity. In contrast, application specific processors are optimized for their intended application, often achieving orders of magnitude improvement in performance. Two decades ago, the minicomputer gained wide acceptance by providing economical computing with adequate performance for a wide variety of applications. Because the hardware and software programming environment design costs can be amortized over a large production run, they are reduced quite dramatically. Of course, minicomputer based systems require the development of custom software which is often as expensive as custom hardware. With the development of highly automated Computer Aided Design systems for the design of VLSI circuits, the design of Application Specific Integrated Circuits (ASIC) has become an attractive way to achieve performance approaching that of custom cmfted custom VLSI with very modest cost. It seems clear that the high level aspects of VLSI CAD technology can be applied (perhaps with higher level "system" extensions) to create "system compilers" that will greatly simplify the design and development of application specific processors. This paradigm drastically reduces the risk and cost to create application specific processors. APPLICATION SPECIFIC PROCESSORS From the earliest times, most high performance signal processors have been realized with application specific processors. The explanation is that application specific processors can be tailored to exactly match the (usually very demanding) application requirements. The result is that no "processing power" is wasted for ix

9 x Preface unnecessary capabilities and maximum perfonnance is achieved. A disadvantage is that such processors have been expensive to design since each is a unique design that is customized to tlle specific application. In the last decade, computer aided design systems have been developed to facilitate the development of application specific integrated circuits. The success of such ASIC CAD systems suggests that it should be possible to streamline the process of application specific processor design. Based on experience is developing VLSI chips and signal processing systems, I believe there are three rules that should guide the development of application specific processors: (1) use only as much arithmetic as necessary, (2) use data interconnections that match tlle algoritlun and (3) use programmability sparingly. The first rule is to use only as much arithmetic as necessary. In earlier application specific processors, the use of minimal aritlunetic was absolutely crucial as large wordsize floating point aritlunetic was prohibitively complex. The advent of VLSI has relaxed this constraint somewhat, but it remains obvious that fixed point arithmetic with small wordsize should be used if possible. The penalty of floating point arithmetic is particularly significant for addition where floating point requires an initial alignment and a final normalization in addition to tlle basic addition operation. The second rule is to use data interconnections that match the algoritllm. In a general purpose computer, such as tlle von Neumann machine shown on Figure I, a large memory holds most of the data and serves as the interconnection media for the succession of aritlunetic operations tllat are performed on data. This is an inefficient process. For example, contrast the effort required to perform an operation on two numbers using a special purpose implementation Witll the effort required with a general purpose processor. For tlle special purpose implementation, data are latched into tlle input registers, tlle result is computed and then latched into the output register. With the general purpose processor at least four steps are required (load first data, load second data, perform tlle operation, and store the result). The special purpose solution avoids tlle need to generate tllfee addresses and to read two data from the memory and write two data to the memory. Instead of passing data through a memory, application specific processors connect from one processor to tlle next as required for tlle algorithm. In cases where data is not in the correct order, small multi-port memories or shift register queues can be used to provide the necessary reordering. The final rule is to use programmability sparingly. It often may seem attractive to use a fast programmable processor or a network of programmable processors to provide the necessary computational capability. Upon close examination, we

10 Earl Swartzlander xi generally discover that this is extremely inefficient. The parts of an algorithm that are stable and that involve fixed processing sequences could have been implemented much more efficiently with custom (non programmable) processors. The remaining portion of the algorithm which requires the flexibility of a programmable processor can be executed at much lower speeds. Thus it is attractive to use an application specific combination of fixed processors with a programmable processor. For this hybrid combination, the fixed processor provides the computational "horsepower" while the programmable processor provides the "steering." DISCUSSION When application specific processors offer such attractive performance, why does anyone ever use a general purpose processor? There are two reasons: (1) general purpose processor hardware and supporting software programming environment development costs can be amortized over a potentially long production run which reduces the cost to any individual user and (2) the risk of fundamental defects in the design is eliminated (at least for all users after the first!) Of course, the second point indicates that only the first user gets state of the art performance. The problem remains that general purpose processors generally offer inadequate throughput for many "interesting" problems. The "general purpose" supercomputers that do achieve high throughputs are not true general purpose machines and are extremely expensive (in cost, power consumption, size, etc.) relative to their performance. The software development process introduces an additional cost factor I1mt is often overlooked in comparisons of application specific versus genera) purpose processors. One solution to U1C low throughput attainable Witll a general purpose processor is to use a large number of them. Connecting a general purpose host processor to an array of processing elements provides reasonably high throughput for many problems. This "semi-application specific" approach has an advantage in that the existing host computer programming environment can be used, which simplifies tlle programming task. The host processor communicates Witll an interface that provides data buffering and control to provide data (via a communication network) to the processor array and to capture data from it. A wide variety of regular geometries have been used as required for specific classes of applications. The processors used in the arrays range from Transputers (advanced computers with flexible communication interfaces) to specialized single bit processing elements. The more advanced processors like the Transputer offer a well developed programming environment which facilitates tlle development of programs for the processing elemems.

11 xii Preface Problems with the semi-application specific array processors include: (1) the difficulty of writing software for an array of processors that interact with each other, (2) the inefficiency of parallel processing, (3) the hardware complexity of the data communication network, and (4) the high complexity of coherently sharing global data amongst a multiplicity of processing elements. It may be noted tlmt most parallel processing systems do not achieve a speed-up commensurate with the number of processors. In the future, the remarkable advances in productivity achieved for VLSI circuits with advanced CAD and silicon compilers will be extended to application specific computing systems. Such syntllesis or "system compilation" will significantly automate the design process for application specific processors. Specifically the system compiler will handle the design of networks of processors where both the tailored network and tlle processor design are optimized for the specific application. A hierarchy of simulators will be used to verify the performance and to confirm correctness of the highly interactive software. Is this chip level CAD, system compilation or both? OUTLINE OF THE BOOK This book consists of eight chapters which provide a mixture of techniques and examples that relate to application specific processing. The inclusion of techniques is expected to suggest additional research and to assist tllose who are faced with the requirement to implement efficient application specific processors. The examples illustrate the application of the concepts and demonstrate tlle efficiency tllat can be achieved via application specific processors. The chapters were written by members and fonner members of tlle application specific processing group at tlle University of Texas at Austin. The first five chapters relate to specialized aritlunetic which often is the key to achieving high perfonnance in application specific processors. The next two chapters focus on signal processing systems, and the final chapter examines the interconnection of possibly disparate elements to create systems. The first chapter, "Variable-Precision, Interval Aritllmetic Processors" is by Michael J. Schulte. This chapter presents tlle design of a processor that efficiently implements interval arithmetic. Here data values are represented by tlle endpoints of intervals that contain tlle coitect value. As data is processed, the intervals lengthen until they are so wide tllat no information is provided about tlle data. Thus part of the attraction of interval arithmetic is that it provides a built in

12 Earl Swartzlander xiii accuracy monitor for all data. Until this work interval arithmetic has been orders of magnitude slower than conventional arithmetic, so it has not been widely used. The second chapter is "Modeling the Power Consumption of CMOS Arithmetic Elements" by Thomas K. Callaway. It compares a number of standard adder and multiplier circuits in terms of their area, delay and the average number of gate transitions that each requires per arithmetic operation. For static CMOS circuits the power consumption is approximately proportional to the number of gate transitions. Thus the average number of gate transitions gives a good approximate estimate of the power consumption. Next is "Fault Tolerant Arithmetic" by Yuang-Ming Hsu. This chapter surveys techniques that facilitate tlle construction of fault tolerant arithmetic processors. Techniques in tlle areas of hardware redundancy, information redundancy and time redundancy are all considered. The time redundant techniques offer attractive performance with a modest complexity overhead in comparison to standard arithmetic. Chapter 4 is "Low Power Digital Multipliers" by Edwin de Angel. It presents several techniques at the algorithm and circuit level that can be employed individually or in combination to substantially reduce the power required by an array implementation of a radix-4 Booth multiplier. Most of the techniques have the effect of increasing the speed. The fifth chapter is "A Unified View of CORDIC Processor Design" by Shaoyun Wang and Vincenzo Piuri. This chapter examines the well known CORDIC algorithm for evaluating sines, cosines and other trigonometric functions. It shows that many of the iterations can be performed in parallel. The next chapter is "Multidimensional Systolic Anays for Computing Discrete Fourier Transforms and Discrete Cosine Transforms" by Hyesook Lim. This work concerns the combination of two semi-systolic arrays to produce a systolic system. This work is a good example of situations where humans have tlle advantage over even the best CAD systems. The fundamental idea presented here is the result of insight that is not likely to be automated in the foreseeable future. The seventh chapter is "Parallel Implementation of a Fast Third-Order Volterra Filtering Algoritllm" by Hercule Kwan. Volterra filters are of great theoretical value in analyzing phenomena is diverse fields ranging from ocean wave shapes and multipatll distortion. The computational loads are exu emely high suggesting the need to consider application specific processing implementations. This chapter reports on some of tlle early work in developing a multiprocessor implementation of Volterra filters.

13 xiv Preface The final chapter is "Design and Implementation of an Interface Control Unit for Rapid Prototyping" by Mohammad S. Khan. This work addresses the need to interconnect multiple processors to implement a system. The processors can all be of the same type (homogeneous) or can be a variety of specialized processors (heterogeneous). Having a generic processor interface is expected to greatly simplify the development of large application specific processing systems. Earl E. Swartzlander, Jr., Austin, Texas

14 APPLICATION SPECIFIC PROCESSORS

TIME-CONSTRAINED TRANSACTION MANAGEMENT. Real-Time Constraints in Database Transaction Systems

TIME-CONSTRAINED TRANSACTION MANAGEMENT Real-Time Constraints in Database Transaction Systems The Kluwer International Series on ADV ANCES IN DATABASE SYSTEMS Other books in the Series: Series Editor Ahmed