Physical Synthesis and Electrical Characterization of the IP-Core of an IEEE-754 Compliant Single Precision Floating Point Unit

Size: px

Start display at page:

Download "Physical Synthesis and Electrical Characterization of the IP-Core of an IEEE-754 Compliant Single Precision Floating Point Unit"

Peter Chapman
6 years ago
Views:

1 Physical Synthesis and Electrical Characterization of the IP-Core of an IEEE-754 Compliant Single Precision Floating Point Unit Alian Engroff, Leonardo Tomazine Neto, Edson Schlosser and Alessandro Girardi Federal University of Pampa - UNIPAMPA Av. Tiaraju, 810 Alegrete RS Brazil alessandro.girardi@unipampa.edu.br ABSTRACT This paper presents the physical design and electrical characterization results applied to the ip-core of a single precision floating point unit (FPU), which follows the IEEE-754 standard for the representation of binary real numbers. An RS-232 serial interface was implemented for the communication between the FPU and a computer for test purposes. Functional tests were performed for several operating frequencies, verifying the results processed by the FPU. The power consumption was also recorded and compared with consumption estimated by the synthesis tools. Tests included addition, subtraction, multiplication and division operations, using different baud rates between the interface and the FPU. Categories and Subject Descriptors B.7.1 [Types and Design Styles]: Algorithms implemented in hardware IEEE 754. General Terms Algorithms, Measurement, Performance, Design, Verification. Keywords floating point unit; digital design; serial interface. 1. INTRODUCTION Floating point operations in most computer architectures are performed by specific hardware components [1]. The component responsible for floating point arithmetic is referred to as Floating Point Unit (FPU). In general, FPUs are incorporated into the processing unit, and they are used to accelerate the implementation of arithmetic calculations using the advantages of binary arithmetic in floating point representation, providing a considerable increase in computational performance [2][3]. There are several ways to represent real numbers in binary domain. The standard format adopted since 1985 is the IEEE-754 [4], which provides the rules for portability in floating point binary format. This work presents the physical implementation of the ip-core of a fully IEEE-754 compliant 32-bit single precision floating point unit. The internal block organization and architecture were developed specifically for this work, aiming design reusability of internal basic components and economy in terms of area and power dissipation. The methodology used in this design was based on the IP-Process [5], adopted by the Brazil-IP program, in which this design is inserted. In the IP-Process methodology, verification procedures are present in each design stage. It divides the design of an IP-core in four phases: behavioral design, architecture, RTL design and prototyping. At the behavioral design stage, functional and nonfunctional requirements are listed in order to define the project scope and acceptance criteria. In the architecture phase the building blocks and connections are established, providing the basis for the implementation and verification. In RTL design, the architecture is described in synthesizable blocks with the inclusion of code for functional verification. Finally, the prototyping stage implements the design in a physical device and electrical tests are performed. This paper aims to present the results of physical level synthesis as well as electrical measurements in a prototyped chip containing a novel architecture of a single precision Floating Point Unit. 2. PROPOSED ARCHITECTURE The architecture of the FPU consists of six main blocks: control, exception verifier, fixed point adder/subtractor, integer multiplier, integer divider and shifter. Intermediate and final data are stored in special function registers. Fig. 1 shows the schematics of the proposed architecture. A microprogrammed control was implemented, with microinstructions defining a set of signals to form the datapath. Although the microprogram is executed sequentially, in some cases we need to branch the execution based on intermediate calculated values. A complete floating point operation is executed multi-cycle. The FPU starts recording the operands X and Y in the input registers. In the next cycle these values are read by the exception verifier block. If some kind of special case representation is detected (Not-a-Number, infinity, zero, etc), it is reported. Thus, a signal is sent to the control block, which returns an exception flag and the transaction is completed. On the other hand, if no special case is identified, the desired arithmetic operation is launched. As mentioned before, most hardware is shared between operations. One of the most important blocks at this step is the fixed point adder/subtractor (Adder_sub in fig. 1). It is responsible for adding both mantissas and exponents, equality comparisons and, moreover, it is also used for normalization and rounding. For example, to perform a floating point multiplication we need to add the exponents. For floating point addition, the same hardware is used for the sum of mantissas. The advantage of using this technique is the economy of functional units, which also means saving silicon area. The disadvantage is the increase in the number of cycles required to perform a complete floating point operation. With this implementation strategy we expect to decrease area and power consumption considerably at the expenses of increasing the processing time. Possible applications of this floating point unit are systems-on-chip and microprocessors for non critical processing requirements in embedded systems.

2 Fig. 1 Proposed architecture for the FPU. 3. SERIAL COMMUNICATION INTERFACE For testing and prototyping purposes, there is a need for the development of an interface for data communication between the FPU and a digital system such as a computer. As the specifications of the FPU indicates that it has 68 input bits - divided between the operands (32-bits X and Y), rounding mode (RM) and operation (OP) -, it is unsuitable to send all bits in parallel. The same idea follows for the output bits, since 32 bits are needed for representing the result of the operation performed by the FPU. In the prototyping stage, it would be not feasible to have a pin for each input or output bit of the FPU, so we opted for serial communication. With this strategy, only two pins are enough for the communication: a transmitter (TXD) and a receiver (RXD), as shown in Fig. 2. Chip Fig. 2 RS-232 interface between the FPU and a computer. For simplification we choose the RS-232 protocol for data communication between the devices, also known as EIA-232. This protocol was developed in the 1960s by Electronic Industries Association (EIA), which specified the voltages, functions and timings of the signals [6]. The EIA-232 specifies rates for data transfer, which is nothing more than the speed at which data is sent through a channel. Commonly used values are: 300, 1200, 2400, 4800, 9600 and 19200bps. In the development of this communication interface, we designed an architecture that automatically calculates the data transfer rate. Data transmission is performed in groups of 8 bits. For each sending byte it is commonly used a bit in the beginning (start bit) meaning that the following 8 bits will be sent. At the end of the transmission of the 8 bits, an extra bit (stop bit) is sent meaning that the byte transmission is over. To perform the transmission and reception of data between the interface and the digital system, it is necessary first to define the rate of data transfer. For this, we created a protocol that automatically performs the calculation of the rate of the data transfer, called autobaud. The autobaud detection is implemented as follows: before sending any data, the computer sends the ASCII code of the character "U" (which is equivalent to binary " ") for the interface module. With this information, the interface calculates the rate of data transfer (baud rate). When the value of the byte " " is "0", the increment of a counter inside the interface module is enabled. This counter is operating at the internal clock frequency. The value is accumulated and, at the end of the byte, an arithmetic average is performed to obtain the value of the transfer rate. Since the clock frequency of the

3 interface module is higher than the transfer frequency (baud rate), the counter will accumulate a number that t is the ratio between those frequencies. The autobaud relation r, stored in the counter, is determined by the relation between the internal clock and the rate of data transmission, as shown in eq. 1. This expression determines the number of times the transmission rate is slower than the internal clock of the interface module. voltage sources on each side of the chip. The third file contains information about the position of I/Os around the chip. The design of the FPU described in this paper is core limited, because the data input and output is serial, demanding few pads. r = internal clock frequency / transmission rate (1) Knowing this relation, the synchronization between the received data signal and the internal clock frequency is performed by acquiring each bit of information at the middle of its period, counting r/2 internal cycles after the beginning of the start bit and other r cycles for sampling the data. The communication protocol continues, after the synchronization byte, by sending the length of the data vector, which informs how much data will be received in the internal memory of the communication interface. The prototyped device is capable to store up to 8 values to be processed. Finally, the operation data are sent, including 4 bytes for each X operand, 4 bytes for each Y operand and a last byte containing round mode and the arithmetic operation to be performed. For example, if the length of the operation vector is 1, a total of 11 bytes are sent in each reception procedure. Fig. 3 shows the diagram of the sequence of data transmission. Fig. 4 - Design flow for physical synthesis using the IC Compiler tool. Fig. 3 Receiving data protocol. After receiving the data, the interface module performs a serial to parallel conversion and sends the operands to the FPU inputs. The start of the FPU operation is automatic and the resulting 32-bit Z=X op Y value is captured again by the interface, converted to serial and transmitted back to a computer. Filler pads have to be inserted for creating spaces between the pads in order to fill the empty space and maintain the connection of the pads supply lines. To enter an exact number of pads fillers, it was necessary to know the width and height of the chip, as well as the total number of pads. The FPU includes 12 input/output pads, 8 supply pads and 4 corners. The width of these pads can be seen in Table PHYSICAL SYNTHESISS AND DRC ANALYSIS The physical synthesis design stage generates the layout of the chip, producing the mask patterns and getting the physical representation of the integrated circuit to be prototyped. Using the Synopsys IC Compiler tool we performed the physical synthesis flow, following the steps until the generation of the GDSII file that contains the final design to be sent to the foundry for prototyping. The design flow of IC Compiler tool is depicted in Fig. 4. The IC Compiler tool needs three input files for the synthesis. The first file contains the environmental conditions and constraints attached to the project, determined by the designer. This information is organized in a file with SDC extension obtained in the previous logic synthesis stage. The second file is the netlist generated by logic synthesis stage, including I/O pads. Both input and output pads have an additional circuit for ESD protection. Also, the supply voltage pins (VSS and VDD) are included. We inserted four pairs of VDD and VSS in order to feed rings around the pads, serving as a source for the I/O pads, and also to provide power to the core logic cells. They were inserted as a pair of Pad Corner I/O Filler Tab. 1 - Pad widths To adjust the width and height of the chip it was necessary to check the original size of the chip in order to perform the calculation and adjustment of the exact number of fillers needed between the pads. The figures reported by the tool can be seen in Tab. 2. Tab. 2 - Chip and core information before filler insertion. Core Pad core Chip µm 89.6µm 11.2µm Height Area (mm 2 )

The calculation of the adjustment was made using eq. 2. The value obtained in nf must be rounded up to get the exact number of fillers needed between each space. With eq.

To solve the problem of clock skew, the clock tree generation creates alternative paths and inserts buffers so that the clock signal reaches the different blocks of the circuit at the same time.

4 The calculation of the adjustment was made using eq. 2. The value obtained in nf must be rounded up to get the exact number of fillers needed between each space. With eq. 3 we can obtain the new width and height of the chip. (2) 1 (3) processing. To solve the problem of clock skew, the clock tree generation creates alternative paths and inserts buffers so that the clock signal reaches the different blocks of the circuit at the same time. After the routing process we perform the DRC and antenna analysis. For the DRC analysis we used Synopsys Hercules tool, which validate the layout rules from the GDSII file extraction. The resulting layout, including the FPU ip-core and the serial interface, is shown in fig. 5. It was prototyped in XFAB 0.35µm technology. The final chip microphotograph is shown in fig. 6. The variables are the number of fillers per side (nf), the number of pads per side (np), the number of corners per side (nc), the width or height of the chip (X), the new width or height of the chip (X'), the width of the corner (Wc), the width of the pad (Wp) and width of the filler (Wf). We found a width of µm and a height of µm. Then, the width and height of the chip were changed to these values and fillers were inserted between the spaces. Table 3 presents the new dimensions of the FPU design. Tab. 3 - Chip and core size after filler insertion Height Area (mm 2 ) Core Pad core Chip The total area of the chip is about 10mm 2. The placement order of the pads, as well as the inclusion of power pads and fillers defines the overall size of the chip, having a direct influence on the cost of the project. The next step is the insertion of metal lines for the supply core. These lines are placed in the empty space between pads and core, in the form of metal rings, in order to maintain the distribution of energy to the core. Each side of the ring has a VDD and a VSS line, which are connected to power pads. In addition to the rings, there are metal lines that cross the core vertically or horizontally. These lines, called straps, are connected to the power rings, and will supply the cells inside the core. After the step of placing the metal supply lines, the next task is the placement of the standard cells. Here, all cells that were previously generated by the logic synthesis are placed into the core. Time constraints provided by the SDC file are taken into account, in order to the placement algorithm allocate related cells more closely, minimizing routing and delay. Required time constraints are evaluated in this step. However, even with the accounting of timing during the positioning phase, there are effects that cannot be overcome only with the positioning of cells and are analyzed in clock tree synthesis. An effect that must be analyzed is the clock skew, which is characterized by the arrival of the clock signal at different times for different circuit components. One reason for this phenomenon is the large difference in distance traveled by the clock signal between the clock pad and different cells. When this occurs, there is a problem with timing in the circuit, prejudicing the data Fig. 5 Layout obtained after physical synthesis. Fig. 6 - Microphotograph of the prototyped ip-core.

5 5. RESULTS Finishing the physical synthesis and prototyping design stages, we present some results obtained by estimation made by the design tool and by electrical measurements. The dynamic power consumption of the entire chip was estimated in 23mW by the IC Compiler tool. Tab. 4 summarizes the estimated power budget. The measured power consumption was 14.6mW, but under waiting conditions (FPU in idle state and with maximum frequency clock signal). The leakage power is 96.85µW, measured with no connected input or output, no clock signal, and with a supply voltage of 3.3V. The maximum estimated operation frequency was 17MHz, but practical tests indicate a maximum frequency of 15MHz. The relation between power consumption and operation frequency is almost linear, as can be seen in fig. 7. Tab. 4 - Estimated dynamic power consumption for the FPU. Cell Internal Power Net Switching Power Cell Leakage Power Total Dynamic Power 6. CONCLUSION Power consumption mW 5.906mW 3.147µW mW This paper presented the complete design flow for the development of an ip-core of a floating point unit applied to embedded devices. Making use of commercial tools from Synopsys, such as Design Compiler for logic synthesis, IC Compiler for physical synthesis, Hercules for DRC analysis and VCS tool for verification, the GDSII file was generated and the circuit was prototyped in 0.35µm technology. An RS-232 serial interface was included in the prototyped chip for communication with a computer for sending and receiving data. Measurement results in the prototyped chip presented a maximum operating frequency of 15MHz, against 17MHz predicted by the synthesis tool. The total chip area was approximately 10mm 2. We developed a new architecture for autobaud detection for a RS- 232 standard interface. For reliability purposes, the autobaud calculation is very important for minimizing synchronism errors. The IP-Process methodology was adopted in this design, including verification at each design stage. It demonstrates to be a good approach, since the design achieved perfect success at the first prototype run. Fig. 7 Measurement results for power dissipation versus operation frequency. 7. ACKNOWLEDGMENTS The grant for supporting this work provided by CNPq Brazilian Agency by means of Brazil-IP Initiative is gratefully acknowledged. 8. REFERENCES [1] R.V.K Pillai, D. Al-Khalili, A.J. Al-Khalili, A Low Power Approach to Floating Adder Design, Proceedings of the 1997 International Conference on Computer Design (ICCD '97), [2] A. M. Nielsen, et al., An IEEE Compliant Floating Point Adder that Conforms with the Pipelined Packet-Forwarding Paradigm, IEEE Transactions on Computers, v. 49, n. 1, Jan [3] M. Lu, Arithmetic and Logic in Computer Systems, Wiley, [4] IEEE Computer Society, IEEE Standard for Binary Floating- Point Arithmetic, IEEE Std , [5] M. S. M. Lima, F. S. D. Santos, J. F. B. Silva, E. N. S. Barros. ipprocess: A Development Process for Soft IP-core with Prototyping in FPGA. In: Forum on Specification and Design Languages (FDL), 2005, Lausanne: EPFL, p [6] Electronic Industries Association. EIA standard RS-232-C: Interface between Data Terminal Equipment and Data Communication Equipment Employing Serial Binary Data Interchange. Washington: Electronic Industries Association. Engineering Dept OCLC

Design and Implementation of a Super Scalar DLX based Microprocessor

Design and Implementation of a Super Scalar DLX based Microprocessor 2 DLX Architecture As mentioned above, the Kishon is based on the original DLX as studies in (Hennessy & Patterson, 1996). By: Amnon