Instruction Set Extensions for Cyclic Redundancy Check on a Multithreaded Processor

Size: px

Start display at page:

Download "Instruction Set Extensions for Cyclic Redundancy Check on a Multithreaded Processor"

Anis Chapman
6 years ago
Views:

1 Instruction Set Extensions for Cyclic Redundancy Check on a ultithreaded Processor Emily R. Blem Dept. of ECE University of Wisconsin-adison adison, WI blem@cae.wisc.edu Suman amidi Dept. of ECE University of Wisconsin-adison adison, WI mamidi@cae.wisc.edu ichael J. Schulte Dept. of ECE University of Wisconsin-adison adison, WI schulte@engr.wisc.edu ABSTRACT Cyclic redundancy check (CRC) algorithms are widely used for error detection in wireless communication systems. CRC is a simple algorithm, but implementations on conventional processors are inefficient as the CRC algorithm is serial and based on bit-wise operations. In this paper, we explore several instruction set extensions to the Sandbridge multithreaded processor for CRC. The performance speedup of each extension is evaluated using the Sandbridge software tools, and the area and delay of the corresponding hardware is presented. The instruction set extensions produce performance gains of up to 23.0x for the CRC kernel. Start Residue[-1:0] = initial value i = 0 Residue = {data[i], Residue[-1:1]} Residue[-1] = 1? i++ Residue = Residue XOR Polynomial 1. INTRODUCTION Cyclic redundancy check (CRC) computations are used in a variety of applications, especially when information transmission or reception is involved. Wireless communication standards that use CRC include Bluetooth, WiAX, WCDA, and WLAN. The basic CRC computation is to divide the incoming bit-stream by an irreducible polynomial and compare the residue before and after transmission. CRC-8 and CRC-CCITT are CRC standards that use an 8- bit polynomial. Similarly, CRC-16 uses a 16-bit polynomial and CRC-32 uses a 32-bit polynomial. Although CRC-32 is the most common CRC computation, CRC-8 and CRC-16 are also useful. This paper examines implementations for all three CRC lengths. The CRC algorithm involves shifts and bit-wise XOR computations, as shown in Figure 1, where is the length of the CRC polynomial. The initial value is set to either 0 or an -bit string of ones. The original CRC hardware implementation shifts one bit of data at a time into a linear feedback shift register (LFSR), so a CRC check on a message that is N bits long requires the execution of a set of operations N times [8]. Implementing an LFSR in software is extremely inefficient as it requires a series of operations on individual bits [12]. To improve performance, CRC calculations are often performed by shifting in 8 bits at a time and using a look-up table with all 256 possible CRC products [1]. Although this significantly speeds up the computation, the algorithm is still serial and thus does not utilize any available parallel hardware. There is significant literature on various parallel CRC algorithms [5, 10, 11]. In [2], z-transforms from digital filter theory are used to parallelize the CRC computation. Galois field implementations are used in [15] with lookahead i < total data length? Figure 1: CRC Algorithm techniques for parallel CRC computation, and they are used again in [7, 3] without lookahead techniques. Although algorithms have been developed to better perform CRC calculations in software, conventional instruction set architectures (ISAs) are still unsuited to operations on individual bits of data. It is therefore desirable to use instruction set extensions to further improve the performance of CRC computations. In this paper, we develop ISA extensions and corresponding hardware designs to implement two different CRC algorithms on the Sandblaster multithreaded processor, and examine the corresponding hardware area and worst case delay, as well as the overall speedup. The Sandblaster processor is designed to support efficient execution of wireless communication and multimedia applications. In high-bandwidth mobile communication systems, standards like WDCA, WLAN, and WiAX must execute quickly and efficiently. Since the CRC-32 standard, in particular, is a part of these increasingly important communication standards, developing instruction set extensions to improve the performance of CRC is an important task. The paper is organized as follows: Section 2 provides an overview of the Sandblaster architecture for which the ISA extensions are developed. Section 3 discusses the CRC al-

2 DIR LRU Replace I-Cache 64KB 64B Lines 4W (2-Active) I-Decode Bus/emory Interface Data emory 64KB 8-Banks L0: lvu %vr0, %r3, 8 vmulreds %ac0, %vr0, %vr0, %ac0 loop %lc0, L0 SIDIQ Instruction Fetch and Branch Unit Integer and Load/ Store Unit SID Vector Unit Figure 3: A 64-bit Compound Instruction T0 T7 T2 T5 T4 T3 T6 T1 Figure 2: Sandblaster Processor gorithms, instruction set extensions, and hardware designs. Section 4 gives experimental results, including overall CRC computation speedups and hardware area and worst case delay. Section 5 provides a summary of the work and suggests a preferred solution for CRC computation using instruction set extensions for the Sandblaster architecture. 2. SANDBLASTER PROCESSOR The Sandblaster processor is designed for embedded mobile communication and multimedia systems with features including compound instructions, SID vector operations, and hardware support for multiple threads. It uses tokentriggered threading and has three units that operate in parallel: an instruction fetch and branch unit, an integer and load/store unit, and a SID vector unit. Our instruction set extensions are implemented in both the integer unit and the vector unit. The integer unit takes up to two 32-bit operands and outputs one 32-bit operand. The vector unit takes up to three operands, where each operand corresponds to a 4- element vector, and outputs one 4-element vector. Figure 2 shows a block diagram of these units and the Sandblaster memory subsytem [6, 13, 9]. The three execution units can be utilized in parallel with the Sandblaster 64-bit compound instruction format. Figure 3 shows a single compound instruction. The operation lvu loads vector register vr0 with four 16-bit elements and updates r3, the address pointer. Concurrently, vmulreds squares the contents of vr0, performs saturating addition with the current accumulator ac0, and puts the result back in ac0. In the branch unit, the loop instruction decrements a counter and branches back to L0 if the loop count is not zero. The Sandblaster processor uses a unique form of interleaved multithreading, called Token Triggered Threading (T 3 ), which is illustrated in Figure 4. With T 3, all threads can be simultaneously executing instructions, but only one thread may issue an instruction on a cycle boundary [9]. This constraint is also imposed on round-robin threading. What distinguishes T 3 is that each clock cycle a token indicates the subsequent thread that is to issue an instruction. Thread ordering may be sequential (e.g. round robin), even/odd, or based on other communication patterns. Compared to Simultaneous ultithreading (ST) [4], T 3 has much less hardware complexity and power dissipation since the method for selecting threads is simplified, only a single compound instruction issues each clock cycle, and dependency checking hardware is eliminated. The current implementation of the Sandblaster Processor supports up to eight simultaneous threads of execution per processor core. Figure 4: Token Triggered Threading The SID vector processing unit (VPU) has four vector processing elements (VPEs). They execute arithmetic and logic operations on 16-bit, 32-bit, or 40-bit fixed-point vector elements in SID fashion. The VPU architecture also contains an accumulator register file, a reduction unit, and a shuffle unit, as shown in Figure 5. High-speed 64-bit data busses allow four 16-bit loads or stores each cycle. For 32- bit operands, load-vector-upper (lvu) and load-vector-lower (lvl) instructions are used to load the data into the VPU in two consecutive thread cycles. ost of the Sandblaster operations have eight pipeline stages, but this latency is hidden by the eight cycles between consecutive instructions in a single thread. This eight cycle latency provides up to four execution stages to perform our instruction set extension calculations, so our extensions can have fairly high latencies and complexities. We present two sets of instruction set extensions; one set is implemented as operations in the integer unit, while the other is implemented in the vector unit. Each of the operations can be included in a compound instruction. For example, our vector CRC operation can replace the vmulreds operation in the compound instruction just described. 3. INSTRUCTION SET EXTENSIONS AND HARDWARE DESIGNS The instruction set extensions are designed to fit within the constraints of the Sandblaster processor architecture. However, the operation designed for the integer unit could be implemented on most processors, and the operation designed to be used in the the vector unit could be used in most processors with a SID-type architecture. To examine the performance benefits of our ISA extensions, the basic CRC algorithm and a Galois field CRC algorithm [7] are written in C-code. We profile these algorithms to find the compute intensive portions of the code. Those portions are then replaced with new operations and we design hardware to perform that portion of the code. The compute intensive portions of code are added to the Sandbridge compiler and simulator as intrinsics. The compiler then treats the new intrinsics as any other operation when scheduling and optimizing the code. The Sandbridge simulator is used to generate cycle counts for the code before and after adding the operations. Section 3.1 details this process for the integer unit, and Section 3.2 discusses the vector unit process. 3.1 Integer Unit

3 Start Residue = 0 Start Read 8 bits into data Residue = 0 index = (Residue XOR data) & 0xff Read 8 bits into data tresidue = lookup(index) reduction(residue, Residue, data) Residue = (tresidue XOR Residue) >> 8 ore data? ore data? Figure 6: CRC Polynomial Reduction Algorithm and Intrinsic Optimization Load Data Store Data VPE0 VPE2 Shuffle Unit VPE1 VPE3 Figure 6 shows the optimization of the CRC algorithm using the integer unit. On the left, we show the original algorithm as implemented in C-code. This is the basic algorithm using an 8-bit table lookup discussed in Section 1 [1]. The shaded boxes are the compute intensive code segments that we replace with an intrinsic. The new algorithm, including the intrinsic reduction, is shown on the right. The intrinsic format is reduction(outputresidue, inputresidue, inputdata). We implement this strategy for CRC-8, CRC-16, and CRC-32 for data chunks of 8, 16, and 32 bits. The hardware corresponding to a 4-bit CRC hardware unit that processes -bits of data is shown in Figure 7. The old residue (denoted by r) and the incoming data (denoted by d) are inputs to the hardware, and the new residue is the output. The polynomial is programmable and is stored in a special-purpose register. Accumulator Data Reduction Unit Accumulator Register File Figure 5: SID Vector Processing Unit 3.2 Vector Unit The vector unit performs a single operation in parallel on four sets of input operands, so we implement a parallel algorithm that uses Galois field operations and is presented in [7]. Galois field arithmetic can be performed efficiently using specialized hardware and be used to implement other algorithms such as Reed-Solomon coding [14]. The algorithm uses Galois field operations over a Galois field of size 2 to paralellize the CRC computation. See [7] for the complete derivation. This field is denoted as (2 ), where is the number of bits in each operand. There are three stages in the implementation: loading the data and CRC polynomial, pre-computing the β factors, and

4 Polynomial d 0 d 1 Old Residue r 3 r 2 r 1 r 0 p 3 p 2 p 1 p 0 p 3 p 2 p 1 p 0 d (-1) p 3 p 2 p 1 p 0 r 3 r 2 r 1 r 0 New Residue Figure 7: CRC Polynomial Reduction Unit then performing Galois field multiplication and addition on the data and β factors. These stages are shown on the left side of Figure 8. In ( 2 ), addition is equivalent to the bitwise XOR of two numbers. ultiplication is a series of XORs and shifts. ultiplying two -bit numbers produces a (2 1)-bit result. This result is then divided by an irreducible polynomial to produce an -bit reduced product. In our case, the CRC generator polynomial is also the Galois reduction polynomial. Loading the data is a simple process; the N-bit message is split into -bit pieces and stored in program memory. The CRC polynomial is stored so that it can be used as the Galois reduction polynomial. The β factors are only dependent on the CRC polynomial and its degree. They can be computed once and, if the CRC generator polynomial remains constant, reused for many different sets of data. So, since most systems use CRC-32 with the polynomial 0x04C11DB7, it is possible to generate the β factors once for many CRC calculations. For this reason, we show performance numbers both with and without β generation in Section 4. The β factors are generated by repeatedly multiplying the CRC generator polynomial by itself using multiplication. After the β factors are generated and the message loaded, all that remains is to properly multiply the β factors by the data chunks and accumulate the result, which is essentially a dot product and is shown in Figure 9. All of these multiplies and additions can occur in parallel or they can occur in series with an accumulator keeping a running total of the additions. Galois field multiplication is not a trivial operation in software, so we implement it in hardware using our instruction set extensions. We introduce two instructions: vgfmul (vector multiply) and vgfmac (vector multiply-accumulate). The intrinsic operand formats are vgfmul (result, multiplicand, multiplier) and vgfmac(result, multiplicand, multiplier, accumulator). Figure 8 shows the implementation with and without the intrinsic. On the left side, + and correspond to Galois field addition and multiplication. The intrinsics operate on vectors with four elements each. The indices in the algorithm on the right refer to the first of those four elements and the next three elements are automatically referenced as well. The β generation process is optimized by generating the first four β values serially, and then repeatedly calling vgfmul to multiply these values by β 3 = β0. 4 The dot product shown in Figure 9 is implemented using vgfmac. There are a total of n = N/ multiplies which must occur, and after each multiply there is an addition, so we use the vgfmac instruction here. The accumulation function is not a critical component, as accumulation is a simple XOR operation that is easily performed in software, but performance is improved by implementing it in hardware with the multiplier. As mentioned in Section 2, each Sandbridge VPE can process 16, 32, and 40-bit data types. However, the default load operation for each VPE is 16 bits. Therefore, when we use 32 bit data, we use two load operations, load vector upper (lvu) and load vector lower (lvl). The compiler is modified to automatically include these special load instructions for instruction set extensions to the vector unit with 32-bit data types. The vgfmul and vgmac hardware is shown in Figure 10 for -bit operands. The preliminary ultiplier is im-

5 Start Start Set CRC polynomial and split data into n chunks of -bits each, called D j Set CRC polynomial and split data into n chunks of -bits each, called D j i = 4, j = 0, A j = 0 0 = Polynomial i = 4, j = 0, A j = 0 0 = Polynomial 1 = 0* 0 2 = 1* 0 3 = 2* 0 1 = 0* 0 2 = 1* 0 3 = 2* 0 i= * (i-1) vgfmul( i, (i-4), 3) i++ i = i+4 i < n? i < n? X j = D j * j vgfmac(a j, D j, j, A (j-1)) A j = A (j-1) + X j j = j+4 Dot Product j++ j < n? j < n? Figure 8: Galois Field CRC Algorithm and Intrinsic Optimization

6 Function 8 bits/cycle 16 bits/cycle 32 bits/cycle CRC-8 2.9x 4.6x 23.0x CRC x 4.6x 23.0x CRC x 4.6x 23.0x Table 1: Unit Speed-up Over Base ISA Using Integer D n-1 D n D 1 n-1 n D 0 0 Function 8 bits/cycle 16 bits/cycle 32 bits/cycle Size (µ 2 ) 1,836 8,462 30,078 Latency (ns) Table 2: Integer Unit Hardware Accelerator Characteristics ULT ULT ADD Final CRC Product ULT ULT Figure 9: Galois Field Dot Product Used in Vector Computation plemented using a parallel array of 2 AND gates, which generateds the partial products of D and β, followed by a tree of 2 XOR gates, which sum the partial products using Galois field addition. The polynomial Galois field reduction unit is similar to that in Figure 7 and uses ( 1) AND gates and ( 1) XOR gates. The AC signal selects between 0 and the accumulator at the end of the operation, allowing the same hardware to be reused for both the vgfmul and vgfmac operations. AC Acc 0 -bit 2-to-1 ux Di i -bit Preliminary ultiply 2-1 -bit Galois Field Polynomial Reduction -bit Galois Field Addition Output Polynomial log2() Figure 10: Galois Field ultiply-accumulate Unit 4. EXPERIENTAL RESULTS Each instruction was simulated using C-code and added to the Sandbridge compiler and simulator as intrinsic instructions that take a single thread cycle to implement. The Sandbridge compiler transforms the intrinsics into an intermediated representation that is optimized and scheduled along with the rest of the code, which lets the new operations be included in the compound instructions and undergo the same optimizations as other operations. Compiler optimizations include vectorization, loop unrolling, software pipelining, code motion, function inlining, and peephole optimizations [13]. We implemented our baseline CRC algorithm and all optimized CRC algorithms using our intrinsics in C-code and simulated them using a data set with 500, bit values. Our 8-bit table lookup implementation was approximately 7 times faster than our bit by bit CRC implementation. We chose the 8-bit table lookup implementation as our baseline for all speedup calculations, since it is a standard software algorithm. Hardware was deigned in Verilog and implemented using the gflxp 0.11 micron COS standard cell library and Synopsys Design Compiler. As shown in Table 1, the speedups achieved in the integer unit are directly proportional to the number of bits of data processed each cycle. Although the size of the arithmetic operations changes with the CRC polynomial length, the speedup is constant when compared to a table lookup algorithm, since in all cases the tables fits in memory and our accelerator performance constraint is the amount of data loaded, not the polynomial length. Hardware area and worst case delay are shown in Table 2. The hardware area and worst case delay both increase as the numbers of bits processed per instruction increases, so the tradeoff between hardware cost and potential acceleration has to be taken into consideration in choosing the proper hardware accelerator size. Table 3 shows speedups using the Galois field CRC algorithm both when β factor generation is included and when

7 Function Including β t Including β CRC-8 on 8-bit AC 3.8x 7.7x CRC-16 on 16-bit AC 7.7x 15.3x CRC-32 on 32-bit AC 4.8x 7.7x Table 3: Speed-up Over Base ISA Using Vector Unit Function CRC-8 CRC-16 CRC-32 Area (µ 2 /VPE) 3,963 18,881 86,111 Latency (ns) Table 4: Vector Unit Hardware Accelerator Characteristics it is not included in the computation time. The CRC-8 algorithm is implemented with 8-bit operations, so it processes 8 bits of data each time that the vgfmac instruction is called. The CRC-16 and CRC-32 algorithms are scaled similairly. The CRC-32 implementation processes more bits per vgfmac operation than the other two implementations, but it is slowed by the increased number of loads that it must perform. Since most computation has been moved to an intrinsic operation, the majority of operations in the assembly code are loads and stores. As discussed in Section 2, the vector unit requires two load instructions to load a vector with 32-bit elements. So, the CRC-16 and CRC-32 implementations have the same total number of loads and stores, even though the CRC-32 implementation calls the vgfmac instruction half as often. Additionally, the loads and stores in the CRC-32 implementation happen in more concentrated bursts, and the compiler creates less optimized compound instructions than in the CRC-16 implementation. In the vector unit, the hardware costs for operations are shown in Table 4. The area increases quadratically as the operand size increases. 5. SUARY This paper gave an overview of the CRC algorithm, discussed the Sandblaster architecture, explained our implementations, and gave the final speedup numbers. We have shown that instruction set extensions can be effectively used to improve the performance of CRC calculations. The hardware added to the integer unit produced a speedup of up to 23.0 times the baseline implementation. In the vector unit, we achieved a speedup of up to 15.3 times. We are limited by the increased software overhead of the Galois field algorithm and the extra loads from the β factors. Although we did not examine memory use, it is significant to note that the β factors from the Galois field implementation and the look up table from the baseline implementation must each be stored in memory, while our integer unit implementation does not have any additonal memory overhead. 6. ADDITIONAL AUTHORS John Glossner, Daniel Iancu, ayan oudgill, and Sanjay Jinturkar (Sandbridge Technologies, White Plains, NY) [3] G. Campobello, G. Patane, and. Russo. Parallel CRC Realization. IEEE Transactions on Computers, 52(10): , October [4] D.. Tullsen, S. J. Eggers, and H.. Levy. Simultaneous ultithreading: aximizing On-Chip Parallelism. In Proceedings of the International Symposium on Computer Architecture, pages , June [5] A. Doering and. Waldvogel. Fast and Flexible CRC Calculation. Electronics Letters, 40(1):10 11, January [6] J. Glossner, S. Dorward, S. Jinturkar,. oudgill, E. Hokenek,. Schulte, and S. Vassiliadis. Sandblaster Software Tools. In Proceedings of the Workshop on Embedded Processor Design Challenges: Systems, Architectures, odeling, and Simulation, July [7] H.. Ji and E. Killian. Fast Parallel CRC Algorithm and Implementation on a Configurable Processor. In Proceedings of IEEE International Conference on Communications, 28 April-2 ay 2002, volume 3, pages , [8] R. Lee. Cyclic Code Redundancy Designers Guide Protects Data. Digital Design, 11(7):77 85, July [9]. Schulte, J. Glossner, S. amidi,. oudgill, and S. Vassiliadis. A Low-Power ultithreaded Processor for Baseband Communication Systems. Embedded Processor Design Challenges: Systems, Architectures, odeling, and Simulation, Lecture tes in Computer Science, 3133: , July [10] A. K. Pandeya and T. J. Cassa. Parallel CRC Lets any Lines Use One Circuit. Computer Design, 14(9):87 91, [11] T.-B. Pei and C. Zukowski. High-Speed Parallel CRC Circuits in VLSI. IEEE Transactions on Communications, 40(4): , [12] T. V. Ramabadran and S. S. Gaitonde. Tutorial on CRC computations. IEEE icro, 8(4):62 75, [13] S. Jinturkar, J. Glossner, V. Kotlyar, and. oudgill. The Sandblaster Automatic ultithreaded Vectorizing Compiler. In Proceedings of the 2004 Global Signal Processing Expo and International Signal Processing Conference, September [14] S. amidi,. Schulte, D. Iancu, A. Iancu, and J. Glossner. Instruction Set Extensions for Reed-Solomon Encoding and Decoding. IEEE 16th International Conference on Application-specific Systems, Architectures and Processors Samos, Greece, [15].-D. Shieh,.-H. Sheu, C.-H. Chen, and H.-F. Lo. A Systematic Approach for Parallel CRC Computations. Journal of Information Science and Engineering, 17(3):445 61, ay REFERENCES [1] A. Perez. Byte-wise CRC Calculations. IEEE icro, pages 40 50, June [2] G. Albertengo and R. Sisto. Parallel CRC generation. IEEE icro, 10(5):63 71, October 1990.

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner Sandbridge Technologies, 1 North Lexington Avenue, White Plains, NY 10601 sjinturkar@sandbridgetech.com