Logic Optimization Techniques for Multiplexers

Logic Optimiation Techniques for Multiplexers Jennifer Stephenson, Applications Engineering Paul Metgen, Software Engineering Altera Corporation 1 Abstract To drive down the cost of today s highly complex FPGA designs, designers are looking to fit the most logic and features into the smallest FPGA device. A study of 100 designs showed that multiplexers accounted for 26% of the logic element utiliation. This paper explores how synthesis tools such as Mentor Graphics Precision RTL Synthesis can infer different types of multiplexers from different styles of HDL code, and how these structures can be mapped into FPGA devices. Inefficient multiplexers can greatly increase the logic required to implement your design. This paper discusses some common pitfalls and provides design guidelines to achieve optimal resource utiliation for multiplexer designs in 4-input look-up table (LUT) based architectures such as Altera s Stratix TM device family. 2 Motivation for Multiplexer Optimiation Altera analyed the synthesis results for 100 customer benchmark designs and found that multiplexers (muxes) accounted for an average of 26% of the logic element utiliation. This result indicates that focusing on multiplexer optimiation could significantly affect the overall logic utiliation for many designs. By optimiing muxes, designers may be able to reduce cost by using a smaller device. Synthesis tools such as Precision RTL Synthesis optimie the designer s source Verilog or VHDL code for both logic utiliation and performance. However, sometimes the best optimiations require human knowledge of the design, and synthesis tools cannot always know the design intent. Designers are often in the best position to improve their quality of results. 3 Types of Multiplexers This section discussed how multiplexers are created from various types of HDL code. Case statements, if statements and state machines are all common sources of multiplexing logic in designs. These HDL structures can create different types of multiplexers: binary multiplexers, selector multiplexers and priority multiplexers. Understanding how multiplexers arise from HDL code and how they might be implemented during synthesis is the first step towards optimiing the structures for best results. 3.1 Binary Multiplexers Binary multiplexers select inputs based on binaryencoded section bits. Figure 1 shows a Verilog example that describes a simple 4:1 binary mux. case (sel) 2'b00: = a; 2'b01: = b; 2'b10: = c; 2'b11: = d; endcase Figure 1: Simple Binary-Encoded Case Statement The select bits do not have to be encoded in full binary notation like the previous example. Synthesis tools might choose to implement more complicated structures using binary multiplexers. The VHDL example in Figure 2 is illustrated schematically as a binary mux in Figure 3. CASE sel[3:0] IS WHEN 0101 => <= a; WHEN 0111 => <= b; WHEN 1010 => <= c; WHEN OTHERS => <= d; END CASE; Figure 2: Case Statement with More Complex Encoding sel[1:0] sel[3:2] 00xx 01xx a b c d 10xx 11xx Binary mux Figure 3: Binary Multiplexer Implementation of the Case Statement in Figure 2 1

3.2 Selector Multiplexers Selector multiplexers have a separate select line for each of their data inputs. The select lines for the mux are essentially one-hot encoded. Figure 4 shows a simple Verilog example that describes a one-hot selector mux. case (sel) 4'b0001: = a; 4'b0010: = b; 4'b0100: = c; 4'b1000: = d; default: = "X"; endcase Figure 4: Simple One-Hot-Encoded Case Statement Synthesis tools can also choose to implement case statements using selector multiplexers. For example, the schematic in Figure 5 shows how the VHDL code in Figure 2 is can be implemented as a selector multiplexer instead of a binary multiplexer. The AND gate on the right-hand side of the figure is used to implement the others or default case (d input), detecting the situation when all the other cases are false, or inactive. Synthesis tools decide which type of multiplexer to implement based on their own algorithms, and different tools may provide different solutions for different types of HDL source code. sel[3:0] 0101 0111 1010 == == == a b c d Figure 5: Selector Multiplexer Implementation of the Case Statement in Figure 2 3.3 Priority Multiplexers Selector mux In priority multiplexers, the select logic implies a priority, so the options to select the correct item must be checked in order. These structures commonly arise from if-else, "when-select", or "? :" statements in VHDL or Verilog. The example VHDL code in Figure 6 is likely to result in the implementation illustrated schematically in Figure 7. IF cond1 THEN <= a; ELSIF cond2 THEN <= b; ELSIF cond3 THEN <= c; ELSE <= d; Figure 6: If Statement Implying Priority Notice that the multiplexers form a chain, evaluating each condition, or select bit, one at a time. cond1 cond2 a cond3 b c Figure 7: Priority Multiplexer Implementation of the If Statement in Figure 6 4 Implementing Multiplexers in 4-Input Look-Up Tables This section discusses how the three styles of multiplexers described in the previous section can be implemented in the 4-input look-up tables (LUTs) found in many FPGA architectures, such as Altera s Stratix devices. Synthesis tools perform this mapping automatically, but understanding it enables you to use a coding style that may be mapped more efficiently. 4.1 Binary Multiplexers A 4:1 binary multiplexer can be implemented very efficiently using just two 4-input LUTs, and larger multiplexers can be built using this structure. 4.1.1 Efficient 4:1 Binary Multiplexers A 4:1 multiplexer can be implemented in two 4-input LUTs, as illustrated in the figures below. Figure 8 shows the configuration when the most significant select line S1 is set to 0. In this case, the select line S0 controls the d 2

right-hand LUT in the figure, and chooses either the C or D input to feed through both LUTs to the output. Figure 9 shows the configuration n when S1 is set to 1. In this case, S0 is fed through to control the left-hand LUT, so S0 chooses between the A and B inputs. S1 = 0 S0 A B C D Figure 8: 4:1 Binary Multiplexer in two LUTs, S1=0 S1 = 1 A B C D S0 scheme, two inputs can be selected, using two select lines, in a single LUT using two AND gates and an OR gate. The outputs of these LUTs can be combined using a wide OR gate. An N-input selector multiplexer of this structure requires at least 0.66*(N-0.5), which is slightly worse than the best binary multiplexer. 4.3 Priority Multiplexers Large priority multiplexers look like a chain of 2:1 muxes, like the example in Figure 7. An N-input priority mux uses a LUT for every 2:1 mux in the chain, requiring N-1 LUTs, or roughly N. In addition, this chain of multiplexers is generally bad for delay, since the critical path through the logic traverses through every multiplexer in the chain. Avoid priority muxes where priority is not required. If the order of the choices is not important to the design, use a case statement to implement a binary or selector mux instead of the priority mux. If priority is required, there are alternate implementations of priority multiplexers that may improve the delay through the logic. The logic structure in Figure 10 uses just slightly more LUTs than the standard priority mux scheme, but significantly improves the delay through the logic. sel0 sel2 sel4 sel6 d0 d1 d2 d3 d4 d5 d6 d7 sel1 sel5 sel1 sel2 sel3 Figure 9: 4:1 Binary Multiplexer in two LUTs, S1=1 4.1.2 Building Larger Binary Multiplexers One technique for building mux trees is to use a basic 2:1 mux as a building block. However, in such a scheme, each 2:1 MUX requires a separate LUT. Implementing an N-input multiplexer (N:1 mux) using this scheme requires at least N - 1 LUTs. Larger binary muxes can be constructed more efficiently using the 4:1 mux presented in section 4.1.1. Constructing an N:1 mux from a tree of 4:1 muxes results in a structure that uses as little as 0.66*(N - 1) LUTs. 4.2 Selector Multiplexers Selector multiplexers are commonly built as a tree of AND and OR gates, as shown in Figure 5. Using this Figure 10: Priority Multiplexer Optimied for Delay In this structure, if any of the select lines sel0-sel4 are high, then the 4-input AND gate chooses the left-hand half of the logic, otherwise it chooses the left-hand side. The 2-input AND gates perform a similar function to choose one of the first level of muxes, then sel0, sel2, sel4, or sel6 makes the final choice of inputs. The signal sel0 has the highest priority in the figure, meaning it represents the first If statement in the HDL source code. 3

The delay optimiations that synthesis tools perform on priority multiplexers vary by tool and depend on the structure of the design. If delay is important in a priority multiplexing design, consider recoding the design to ensure a scheme that reduces the number of levels of logic. 5 Design Guidelines to Avoid Common Pitfalls This section investigates several common pitfalls in multiplexer design, and provides design guidelines to avoid these pitfalls. By taking care when coding your design, you can achieve better logic utiliation efficiency. 5.1 Default or Others Case Assignment To fully specify the cases in a case statement, you need to include a default (Verilog) or others (VHDL) assignment. This assignment is especially important in one-hot encoding schemes where many combinations of the select lines are not used. Specifying a case for the unused select line combinations directs the synthesis tool on how to deal with these cases, and is required by the Verilog and VHDL language specifications. Some designs do not have a requirement for the unused cases, often because it is assumed that these cases will not arise. In these situations, you can choose any value for the default or others assignment. However, be aware that the assignment value you choose can have a large effect on the logic utiliation required to implement the design, due to the different ways synthesis tools treat difference values for the assignment. 5.1.1 Example: Precision RTL Synthesis 2003.72 Results for a 4:1 Selector Mux The effect that the default or others case assignment can have on synthesis results is best illustrated with an example. The results in this section were generated for Altera s Stratix architecture using Precision RTL Synthesis version 2003.72, but most synthesis tools will show a similar difference in results. In the simple 4:1 selector multiplexer design shown in Figure 4, the designer has created a default assignment to X, or don t care. Note that in the Stratix architecture, this 4:1 mux design could be implemented in three LEs, but that optimiation is not currently available in this version of Precision RTL. Compiled in Precision RTL, this design requires four LUTs and thus uses four logic elements (LEs) in the Stratix device. Figure 11 shows a modified version of the code, where the designer has assigned cases for inputs a, b, and c, but then made the default assignment to choose input d in all other cases, since the only other case of interest should choose input d. This design actually requires seven LEs when compiled in Precision RTL! This is 75% more LEs than the example where X was assigned as the default value. case (sel) 4'b0001: = a; 4'b0010: = b; 4'b0100: = c; default: = d; endcase Figure 11: One-Hot-Encoded Case Statement with d as Default Case Assigning case d separately as in Figure 4, but assigning the default case to input d gives the same result as the code in Figure 11. Choosing any of the other inputs as the default would also give the same utiliation result. Since there is no valid assignment for the invalid cases, some designers may set the default case to be unchanged, or for the output to keep the same value, as shown in Figure 12. This type of assignment requires more logic, because the synthesis tool has to implement feedback from the output back into the multiplexer. This design takes eight LEs to implement in Precision RTL. case (sel) 4'b0001: = a; 4'b0010: = b; 4'b0100: = c; 4'b1000: = d; default: = ; endcase Figure 12: One-Hot-Encoded Case Statement with as Default Case These three pieces of code perform the same function for the valid combinations of the select lines, yet the difference in logic utiliation is huge! To obtain best results, explicitly define your invalid case selections with a separate default or others statement, instead of combining the invalid cases with one of the defined cases. If you do not care about the value in the invalid cases, explicitly say so by assigning the X logic value for these cases instead of choosing another value. The difference in logic utiliation in these different cases is due either to the decode logic, or to inefficiencies in the per-bit multiplexing cost. Synthesis tools may give more efficient results when the multiplexers select bus inputs. In this example, if a, b, c, d, and are each 32- bit buses, each coding style results in approximately the 4

same number of LEs (101, 100, and 105 LEs respectively) in Precision RTL. These results are much better optimied than the single-bit result, using as little as 3.125 Stratix LEs per bit of the bus, much closer to the optimal result of three LEs for a 4:1 binary mux. This result indicates that the per-bit cost is the same for all three schemes (three LEs/bit) and the differences in logic utiliation are due to the decoding logic. If you are concerned about area utiliation for your multiplexers, examine your synthesis results to ensure you are getting the expected logic utiliation. Different synthesis tools (and versions) may give different results due to various speed and area optimiations in the tools, so knowledge of the optimal result for a given design can be very powerful. 5.2 Implicit Defaults The If statements in Verilog and VHDL are a convenient way of specifying conditions that don t easily lend themselves to a case type approach. However, these statements can result in complicated multiplexer trees that are not easy for synthesis tools to optimie. In particular, every If statement has an implicit Else case, even if it is not specified. These implicit defaults can cause additional complexity in a multiplexing design. The code sample in Figure 13 appears to represent a 4:1 multiplexer; there are four inputs (a, b, c, d) and one output (). IF cond1 THEN IF cond2 THEN <= a; ELSIF cond3 THEN IF cond4 THEN <= b; ELSIF cond5 THEN <= c; ELSIF cond6 THEN <= d; Figure 13: If Statement with Implicit Defaults However, each of the three separate If statements in the code has an implicit Else condition that is not specified. Since the output values for the Else cases are not specified, the synthesis tool has to assume the intent is to maintain the same output value for these cases. Figure 14 shows code with the same functionality as Figure 13 but specifies the Else cases explicitly. IF cond1 THEN IF cond2 THEN <= a; ELSE <= ; ELSIF cond3 THEN IF cond4 THEN <= b; ELSIF cond5 THEN <= c; ELSE <= ; ELSIF cond6 THEN <= d; ELSE <= ; Figure 14: If Statement with Default Conditions Explicitly Specified Figure 15 is a schematic representation of the above code illustrating that although there are only four inputs, the multiplexing logic is significantly more complicated than a basic 4:1 mux. cond5 cond2 0 1 cond4 0 1 cond1 a c 0 1 Figure 15: Multiplexer Implementation of the If Statement in Figure 13 and Figure 14 b cond3 cond6 You can do several things in these cases to simplify the multiplexing logic and remove the unneeded defaults. The most optimal way may be to recode the design so it takes the structure of a 4:1 case statement. Alternately, or if the priority is important, you can restructure the code to deduce default cases and flatten the multiplexer. In this example, instead of IF cond1 THEN IF cond2, use IF (cond1 AND cond2) which performs the same function. In addition, question whether the defaults are don t care cases. In this example, you can promote the last ELSIF cond6 statement to an ELSE statement if no other valid cases can occur. Avoid unnecessary default conditions in your multiplexer logic to reduce the complexity and the logic utiliation required to implement your design. d 5

5.3 Degenerate Multiplexers CASE sel[3:0] IS A degenerate multiplexer is one in which not all of the possible cases are used for unique data inputs. The unneeded cases tend to contribute to inefficiency in the logic utiliation for these multiplexers. You can recode degenerate muxes so that they take advantage of the efficient logic utiliation possible with full binary muxes. The number of select lines in a binary multiplexer normally dictates how big a mux is needed to implement the desired function. For example, the mux structure represented in Figure 3 has four select lines and could implement a binary multiplexer with 16 inputs. However, the figure does not use all 16 inputs and thus is considered a degenerate 16:1 mux. According to the results in section 4.1.2, a 16:1 binary mux can be implemented in ten 4-input LUTs. Most synthesis tools, though, can perform local optimiations on degenerate muxes that look at each mux individually and improve the logic utiliation. In this example, the first and fourth muxes in the top level can easily be eliminated since all four inputs to each mux are the same value, and the number of inputs to the other multiplexers can be reduced, as shown in Figure 16. a b c d WHEN 0101 => _sel <= 00 ; WHEN 0111 => _sel <= 01 ; WHEN 1010 => _sel <= 10 ; WHEN OTHERS => _sel <= 11 ; END CASE; Figure 17: Recoder Design for Degenerate Binary Multiplexer CASE _sel[1:0] IS WHEN 00 => <= a; WHEN 01 => <= b; WHEN 10 => <= c; WHEN 11 => <= d; END CASE; Figure 18: 4:1 Binary Multiplexer Design You can use the new _sel control signal from the recoder to control the 4:1 binary multiplexer that chooses between the four inputs a, b, c, and d, as illustrated in Figure 19. The complexity of the select lines is handled in the recoder, and the data multiplexing is performed with simple binary select lines to enable the most efficient implementation. sel[3:0] Recoder a b c d sel[1:0] 3:1 2:1 _sel[1:0] 01xx 00xx sel[3:2] 3:1 10xx Figure 16: Optimied Version of the Degenerate Binary Multiplexer from Figure 3 Implementing this version of the multiplexer still requires at least 5 LUTs in total, two for each of the 3:1 muxes one for the 2:1 mux. This design selects an output from only four inputs, and from section 4.1.1 a 4:1 binary mux can be implemented optimally in 2 LUTs, so this degenerate multiplexer tree is reducing the efficiency of the logic. You can improve the logic utiliation of this type of structure by recoding the select lines to implement a full 4:1 binary mux. Figure 17 provides code for a recoder design that translates the original select lines into a signal _sel with binary encoding, and Figure 18 provides code to implement the full binary mux. 11xx 4:1 Figure 19: 4:1 Binary Multiplexer with Recoder The recoder design can be implemented in two LUTs and the efficient 4:1 binary mux uses two LUTs, for a total of four LUTs. The original degenerate mux required five LUTs, so the recoded version uses 20% less logic than the original. You can often improve the logic utiliation of multiplexers by recoding the select lines into full binary cases. Although logic is required to do the encoding, more logic may be saved performing the data multiplexing. 5.4 Buses of Multiplexers The inputs to multiplexers are often buses of data inputs, where the same multiplexing function is performed on a set of data inputs in the form of buses. In these cases, any inefficiency in the multiplexer is multiplied across every bit of the bus. The issues described in the previous sections become even more important for wide mux buses. 6

For example, the recoding technique discussed in the previous section can often be used in buses that involve multiplexing. Recoding the select lines may only need to be done once for all the multiplexers in the bus. By sharing the recoder logic among all the bits in the bus, you can greatly improve the logic efficiency of a bus of muxes. The degenerate multiplexer in section 5.3 requires five LUTs to implement. If the inputs and output are 32 bits wide, the function could require 32x5 or 160 LUTs for the whole bus. The recoded design uses only two LUTs, and the select lines only need to be recoded once for the entire bus. The binary 4:1 mux requires two LEs per bit of the bus. The total logic utiliation for the recoded version could be 2 + 2x32 or 66 LUTs for the whole bus, as compared to 160 LUTs for the original version! The savings in logic become much more obvious when the mux works across wide buses. Using techniques to optimie degenerate muxes, removing unneeded implicit defaults, and choosing the optimal default or others case can play an important role when optimiing buses of multiplexers. 7 References [1] Mentor Graphics Precision RTL Synthesis Support, Quartus II Development Software Handbook, www.altera.com/literature/lit-qts.jsp [2] Stratix Architecture, Stratix Device Handbook, www.altera.com/literature/lit-stx.jsp 6 Conclusion Logic utiliation is an important cost factor in FPGA designs, and designers can use logic optimiation to reduce the logic required to implement their designs. Synthesis tools optimie Verilog or VHDL code for both logic utiliation and performance, but in some cases, the designers with knowledge of the original design intent are in the best position to improve their quality of results. Multiplexing logic takes up a large portion of the typical FPGA design, and inefficient multiplexers can greatly increase the logic required to implement your design. To optimie the resource utiliation for mux structures, it is important to understand how multiplexers arise from HDL code, and how they might be implemented in the target device. Use the techniques discussed in this paper to choose the optimal default or others case for your case statements, avoid unnecessary default conditions in your if statements, and optimie degenerate muxes to allow the most efficient multiplexer implementation. If your design multiplexes buses of data, these techniques are even more important. Armed with an in-depth knowledge of multiplexer implementation, you can optimie your design to ensure that it achieves the optimal logic utiliation. 7

101 Innovation Drive San Jose, CA 95134 (408) 544-7000 http://www.altera.com Copyright 2006 Altera Corporation. All rights reserved. Altera, The Programmable Solutions Company, the stylied Altera logo, specific device designations, and all other words and logos that are identified as trademarks and/or service marks are, unless noted otherwise, the trademarks and service marks of Altera Corporation in the U.S. and other countries. All other product or service names are the property of their respective holders. Altera products are protected under numerous U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera Corporation. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services.