Instruction Set Extensions for Cyclic Redundancy Check on a Multithreaded Processor

Size: px
Start display at page:

Download "Instruction Set Extensions for Cyclic Redundancy Check on a Multithreaded Processor"

Transcription

1 Instruction Set Extensions for Cyclic Redundancy Check on a ultithreaded Processor Emily R. Blem Dept. of ECE University of Wisconsin-adison adison, WI blem@cae.wisc.edu Suman amidi Dept. of ECE University of Wisconsin-adison adison, WI mamidi@cae.wisc.edu ichael J. Schulte Dept. of ECE University of Wisconsin-adison adison, WI schulte@engr.wisc.edu ABSTRACT Cyclic redundancy check (CRC) algorithms are widely used for error detection in wireless communication systems. CRC is a simple algorithm, but implementations on conventional processors are inefficient as the CRC algorithm is serial and based on bit-wise operations. In this paper, we explore several instruction set extensions to the Sandbridge multithreaded processor for CRC. The performance speedup of each extension is evaluated using the Sandbridge software tools, and the area and delay of the corresponding hardware is presented. The instruction set extensions produce performance gains of up to 23.0x for the CRC kernel. Start Residue[-1:0] = initial value i = 0 Residue = {data[i], Residue[-1:1]} Residue[-1] = 1? i++ Residue = Residue XOR Polynomial 1. INTRODUCTION Cyclic redundancy check (CRC) computations are used in a variety of applications, especially when information transmission or reception is involved. Wireless communication standards that use CRC include Bluetooth, WiAX, WCDA, and WLAN. The basic CRC computation is to divide the incoming bit-stream by an irreducible polynomial and compare the residue before and after transmission. CRC-8 and CRC-CCITT are CRC standards that use an 8- bit polynomial. Similarly, CRC-16 uses a 16-bit polynomial and CRC-32 uses a 32-bit polynomial. Although CRC-32 is the most common CRC computation, CRC-8 and CRC-16 are also useful. This paper examines implementations for all three CRC lengths. The CRC algorithm involves shifts and bit-wise XOR computations, as shown in Figure 1, where is the length of the CRC polynomial. The initial value is set to either 0 or an -bit string of ones. The original CRC hardware implementation shifts one bit of data at a time into a linear feedback shift register (LFSR), so a CRC check on a message that is N bits long requires the execution of a set of operations N times [8]. Implementing an LFSR in software is extremely inefficient as it requires a series of operations on individual bits [12]. To improve performance, CRC calculations are often performed by shifting in 8 bits at a time and using a look-up table with all 256 possible CRC products [1]. Although this significantly speeds up the computation, the algorithm is still serial and thus does not utilize any available parallel hardware. There is significant literature on various parallel CRC algorithms [5, 10, 11]. In [2], z-transforms from digital filter theory are used to parallelize the CRC computation. Galois field implementations are used in [15] with lookahead i < total data length? Figure 1: CRC Algorithm techniques for parallel CRC computation, and they are used again in [7, 3] without lookahead techniques. Although algorithms have been developed to better perform CRC calculations in software, conventional instruction set architectures (ISAs) are still unsuited to operations on individual bits of data. It is therefore desirable to use instruction set extensions to further improve the performance of CRC computations. In this paper, we develop ISA extensions and corresponding hardware designs to implement two different CRC algorithms on the Sandblaster multithreaded processor, and examine the corresponding hardware area and worst case delay, as well as the overall speedup. The Sandblaster processor is designed to support efficient execution of wireless communication and multimedia applications. In high-bandwidth mobile communication systems, standards like WDCA, WLAN, and WiAX must execute quickly and efficiently. Since the CRC-32 standard, in particular, is a part of these increasingly important communication standards, developing instruction set extensions to improve the performance of CRC is an important task. The paper is organized as follows: Section 2 provides an overview of the Sandblaster architecture for which the ISA extensions are developed. Section 3 discusses the CRC al-

2 DIR LRU Replace I-Cache 64KB 64B Lines 4W (2-Active) I-Decode Bus/emory Interface Data emory 64KB 8-Banks L0: lvu %vr0, %r3, 8 vmulreds %ac0, %vr0, %vr0, %ac0 loop %lc0, L0 SIDIQ Instruction Fetch and Branch Unit Integer and Load/ Store Unit SID Vector Unit Figure 3: A 64-bit Compound Instruction T0 T7 T2 T5 T4 T3 T6 T1 Figure 2: Sandblaster Processor gorithms, instruction set extensions, and hardware designs. Section 4 gives experimental results, including overall CRC computation speedups and hardware area and worst case delay. Section 5 provides a summary of the work and suggests a preferred solution for CRC computation using instruction set extensions for the Sandblaster architecture. 2. SANDBLASTER PROCESSOR The Sandblaster processor is designed for embedded mobile communication and multimedia systems with features including compound instructions, SID vector operations, and hardware support for multiple threads. It uses tokentriggered threading and has three units that operate in parallel: an instruction fetch and branch unit, an integer and load/store unit, and a SID vector unit. Our instruction set extensions are implemented in both the integer unit and the vector unit. The integer unit takes up to two 32-bit operands and outputs one 32-bit operand. The vector unit takes up to three operands, where each operand corresponds to a 4- element vector, and outputs one 4-element vector. Figure 2 shows a block diagram of these units and the Sandblaster memory subsytem [6, 13, 9]. The three execution units can be utilized in parallel with the Sandblaster 64-bit compound instruction format. Figure 3 shows a single compound instruction. The operation lvu loads vector register vr0 with four 16-bit elements and updates r3, the address pointer. Concurrently, vmulreds squares the contents of vr0, performs saturating addition with the current accumulator ac0, and puts the result back in ac0. In the branch unit, the loop instruction decrements a counter and branches back to L0 if the loop count is not zero. The Sandblaster processor uses a unique form of interleaved multithreading, called Token Triggered Threading (T 3 ), which is illustrated in Figure 4. With T 3, all threads can be simultaneously executing instructions, but only one thread may issue an instruction on a cycle boundary [9]. This constraint is also imposed on round-robin threading. What distinguishes T 3 is that each clock cycle a token indicates the subsequent thread that is to issue an instruction. Thread ordering may be sequential (e.g. round robin), even/odd, or based on other communication patterns. Compared to Simultaneous ultithreading (ST) [4], T 3 has much less hardware complexity and power dissipation since the method for selecting threads is simplified, only a single compound instruction issues each clock cycle, and dependency checking hardware is eliminated. The current implementation of the Sandblaster Processor supports up to eight simultaneous threads of execution per processor core. Figure 4: Token Triggered Threading The SID vector processing unit (VPU) has four vector processing elements (VPEs). They execute arithmetic and logic operations on 16-bit, 32-bit, or 40-bit fixed-point vector elements in SID fashion. The VPU architecture also contains an accumulator register file, a reduction unit, and a shuffle unit, as shown in Figure 5. High-speed 64-bit data busses allow four 16-bit loads or stores each cycle. For 32- bit operands, load-vector-upper (lvu) and load-vector-lower (lvl) instructions are used to load the data into the VPU in two consecutive thread cycles. ost of the Sandblaster operations have eight pipeline stages, but this latency is hidden by the eight cycles between consecutive instructions in a single thread. This eight cycle latency provides up to four execution stages to perform our instruction set extension calculations, so our extensions can have fairly high latencies and complexities. We present two sets of instruction set extensions; one set is implemented as operations in the integer unit, while the other is implemented in the vector unit. Each of the operations can be included in a compound instruction. For example, our vector CRC operation can replace the vmulreds operation in the compound instruction just described. 3. INSTRUCTION SET EXTENSIONS AND HARDWARE DESIGNS The instruction set extensions are designed to fit within the constraints of the Sandblaster processor architecture. However, the operation designed for the integer unit could be implemented on most processors, and the operation designed to be used in the the vector unit could be used in most processors with a SID-type architecture. To examine the performance benefits of our ISA extensions, the basic CRC algorithm and a Galois field CRC algorithm [7] are written in C-code. We profile these algorithms to find the compute intensive portions of the code. Those portions are then replaced with new operations and we design hardware to perform that portion of the code. The compute intensive portions of code are added to the Sandbridge compiler and simulator as intrinsics. The compiler then treats the new intrinsics as any other operation when scheduling and optimizing the code. The Sandbridge simulator is used to generate cycle counts for the code before and after adding the operations. Section 3.1 details this process for the integer unit, and Section 3.2 discusses the vector unit process. 3.1 Integer Unit

3 Start Residue = 0 Start Read 8 bits into data Residue = 0 index = (Residue XOR data) & 0xff Read 8 bits into data tresidue = lookup(index) reduction(residue, Residue, data) Residue = (tresidue XOR Residue) >> 8 ore data? ore data? Figure 6: CRC Polynomial Reduction Algorithm and Intrinsic Optimization Load Data Store Data VPE0 VPE2 Shuffle Unit VPE1 VPE3 Figure 6 shows the optimization of the CRC algorithm using the integer unit. On the left, we show the original algorithm as implemented in C-code. This is the basic algorithm using an 8-bit table lookup discussed in Section 1 [1]. The shaded boxes are the compute intensive code segments that we replace with an intrinsic. The new algorithm, including the intrinsic reduction, is shown on the right. The intrinsic format is reduction(outputresidue, inputresidue, inputdata). We implement this strategy for CRC-8, CRC-16, and CRC-32 for data chunks of 8, 16, and 32 bits. The hardware corresponding to a 4-bit CRC hardware unit that processes -bits of data is shown in Figure 7. The old residue (denoted by r) and the incoming data (denoted by d) are inputs to the hardware, and the new residue is the output. The polynomial is programmable and is stored in a special-purpose register. Accumulator Data Reduction Unit Accumulator Register File Figure 5: SID Vector Processing Unit 3.2 Vector Unit The vector unit performs a single operation in parallel on four sets of input operands, so we implement a parallel algorithm that uses Galois field operations and is presented in [7]. Galois field arithmetic can be performed efficiently using specialized hardware and be used to implement other algorithms such as Reed-Solomon coding [14]. The algorithm uses Galois field operations over a Galois field of size 2 to paralellize the CRC computation. See [7] for the complete derivation. This field is denoted as (2 ), where is the number of bits in each operand. There are three stages in the implementation: loading the data and CRC polynomial, pre-computing the β factors, and

4 Polynomial d 0 d 1 Old Residue r 3 r 2 r 1 r 0 p 3 p 2 p 1 p 0 p 3 p 2 p 1 p 0 d (-1) p 3 p 2 p 1 p 0 r 3 r 2 r 1 r 0 New Residue Figure 7: CRC Polynomial Reduction Unit then performing Galois field multiplication and addition on the data and β factors. These stages are shown on the left side of Figure 8. In ( 2 ), addition is equivalent to the bitwise XOR of two numbers. ultiplication is a series of XORs and shifts. ultiplying two -bit numbers produces a (2 1)-bit result. This result is then divided by an irreducible polynomial to produce an -bit reduced product. In our case, the CRC generator polynomial is also the Galois reduction polynomial. Loading the data is a simple process; the N-bit message is split into -bit pieces and stored in program memory. The CRC polynomial is stored so that it can be used as the Galois reduction polynomial. The β factors are only dependent on the CRC polynomial and its degree. They can be computed once and, if the CRC generator polynomial remains constant, reused for many different sets of data. So, since most systems use CRC-32 with the polynomial 0x04C11DB7, it is possible to generate the β factors once for many CRC calculations. For this reason, we show performance numbers both with and without β generation in Section 4. The β factors are generated by repeatedly multiplying the CRC generator polynomial by itself using multiplication. After the β factors are generated and the message loaded, all that remains is to properly multiply the β factors by the data chunks and accumulate the result, which is essentially a dot product and is shown in Figure 9. All of these multiplies and additions can occur in parallel or they can occur in series with an accumulator keeping a running total of the additions. Galois field multiplication is not a trivial operation in software, so we implement it in hardware using our instruction set extensions. We introduce two instructions: vgfmul (vector multiply) and vgfmac (vector multiply-accumulate). The intrinsic operand formats are vgfmul (result, multiplicand, multiplier) and vgfmac(result, multiplicand, multiplier, accumulator). Figure 8 shows the implementation with and without the intrinsic. On the left side, + and correspond to Galois field addition and multiplication. The intrinsics operate on vectors with four elements each. The indices in the algorithm on the right refer to the first of those four elements and the next three elements are automatically referenced as well. The β generation process is optimized by generating the first four β values serially, and then repeatedly calling vgfmul to multiply these values by β 3 = β0. 4 The dot product shown in Figure 9 is implemented using vgfmac. There are a total of n = N/ multiplies which must occur, and after each multiply there is an addition, so we use the vgfmac instruction here. The accumulation function is not a critical component, as accumulation is a simple XOR operation that is easily performed in software, but performance is improved by implementing it in hardware with the multiplier. As mentioned in Section 2, each Sandbridge VPE can process 16, 32, and 40-bit data types. However, the default load operation for each VPE is 16 bits. Therefore, when we use 32 bit data, we use two load operations, load vector upper (lvu) and load vector lower (lvl). The compiler is modified to automatically include these special load instructions for instruction set extensions to the vector unit with 32-bit data types. The vgfmul and vgmac hardware is shown in Figure 10 for -bit operands. The preliminary ultiplier is im-

5 Start Start Set CRC polynomial and split data into n chunks of -bits each, called D j Set CRC polynomial and split data into n chunks of -bits each, called D j i = 4, j = 0, A j = 0 0 = Polynomial i = 4, j = 0, A j = 0 0 = Polynomial 1 = 0* 0 2 = 1* 0 3 = 2* 0 1 = 0* 0 2 = 1* 0 3 = 2* 0 i= * (i-1) vgfmul( i, (i-4), 3) i++ i = i+4 i < n? i < n? X j = D j * j vgfmac(a j, D j, j, A (j-1)) A j = A (j-1) + X j j = j+4 Dot Product j++ j < n? j < n? Figure 8: Galois Field CRC Algorithm and Intrinsic Optimization

6 Function 8 bits/cycle 16 bits/cycle 32 bits/cycle CRC-8 2.9x 4.6x 23.0x CRC x 4.6x 23.0x CRC x 4.6x 23.0x Table 1: Unit Speed-up Over Base ISA Using Integer D n-1 D n D 1 n-1 n D 0 0 Function 8 bits/cycle 16 bits/cycle 32 bits/cycle Size (µ 2 ) 1,836 8,462 30,078 Latency (ns) Table 2: Integer Unit Hardware Accelerator Characteristics ULT ULT ADD Final CRC Product ULT ULT Figure 9: Galois Field Dot Product Used in Vector Computation plemented using a parallel array of 2 AND gates, which generateds the partial products of D and β, followed by a tree of 2 XOR gates, which sum the partial products using Galois field addition. The polynomial Galois field reduction unit is similar to that in Figure 7 and uses ( 1) AND gates and ( 1) XOR gates. The AC signal selects between 0 and the accumulator at the end of the operation, allowing the same hardware to be reused for both the vgfmul and vgfmac operations. AC Acc 0 -bit 2-to-1 ux Di i -bit Preliminary ultiply 2-1 -bit Galois Field Polynomial Reduction -bit Galois Field Addition Output Polynomial log2() Figure 10: Galois Field ultiply-accumulate Unit 4. EXPERIENTAL RESULTS Each instruction was simulated using C-code and added to the Sandbridge compiler and simulator as intrinsic instructions that take a single thread cycle to implement. The Sandbridge compiler transforms the intrinsics into an intermediated representation that is optimized and scheduled along with the rest of the code, which lets the new operations be included in the compound instructions and undergo the same optimizations as other operations. Compiler optimizations include vectorization, loop unrolling, software pipelining, code motion, function inlining, and peephole optimizations [13]. We implemented our baseline CRC algorithm and all optimized CRC algorithms using our intrinsics in C-code and simulated them using a data set with 500, bit values. Our 8-bit table lookup implementation was approximately 7 times faster than our bit by bit CRC implementation. We chose the 8-bit table lookup implementation as our baseline for all speedup calculations, since it is a standard software algorithm. Hardware was deigned in Verilog and implemented using the gflxp 0.11 micron COS standard cell library and Synopsys Design Compiler. As shown in Table 1, the speedups achieved in the integer unit are directly proportional to the number of bits of data processed each cycle. Although the size of the arithmetic operations changes with the CRC polynomial length, the speedup is constant when compared to a table lookup algorithm, since in all cases the tables fits in memory and our accelerator performance constraint is the amount of data loaded, not the polynomial length. Hardware area and worst case delay are shown in Table 2. The hardware area and worst case delay both increase as the numbers of bits processed per instruction increases, so the tradeoff between hardware cost and potential acceleration has to be taken into consideration in choosing the proper hardware accelerator size. Table 3 shows speedups using the Galois field CRC algorithm both when β factor generation is included and when

7 Function Including β t Including β CRC-8 on 8-bit AC 3.8x 7.7x CRC-16 on 16-bit AC 7.7x 15.3x CRC-32 on 32-bit AC 4.8x 7.7x Table 3: Speed-up Over Base ISA Using Vector Unit Function CRC-8 CRC-16 CRC-32 Area (µ 2 /VPE) 3,963 18,881 86,111 Latency (ns) Table 4: Vector Unit Hardware Accelerator Characteristics it is not included in the computation time. The CRC-8 algorithm is implemented with 8-bit operations, so it processes 8 bits of data each time that the vgfmac instruction is called. The CRC-16 and CRC-32 algorithms are scaled similairly. The CRC-32 implementation processes more bits per vgfmac operation than the other two implementations, but it is slowed by the increased number of loads that it must perform. Since most computation has been moved to an intrinsic operation, the majority of operations in the assembly code are loads and stores. As discussed in Section 2, the vector unit requires two load instructions to load a vector with 32-bit elements. So, the CRC-16 and CRC-32 implementations have the same total number of loads and stores, even though the CRC-32 implementation calls the vgfmac instruction half as often. Additionally, the loads and stores in the CRC-32 implementation happen in more concentrated bursts, and the compiler creates less optimized compound instructions than in the CRC-16 implementation. In the vector unit, the hardware costs for operations are shown in Table 4. The area increases quadratically as the operand size increases. 5. SUARY This paper gave an overview of the CRC algorithm, discussed the Sandblaster architecture, explained our implementations, and gave the final speedup numbers. We have shown that instruction set extensions can be effectively used to improve the performance of CRC calculations. The hardware added to the integer unit produced a speedup of up to 23.0 times the baseline implementation. In the vector unit, we achieved a speedup of up to 15.3 times. We are limited by the increased software overhead of the Galois field algorithm and the extra loads from the β factors. Although we did not examine memory use, it is significant to note that the β factors from the Galois field implementation and the look up table from the baseline implementation must each be stored in memory, while our integer unit implementation does not have any additonal memory overhead. 6. ADDITIONAL AUTHORS John Glossner, Daniel Iancu, ayan oudgill, and Sanjay Jinturkar (Sandbridge Technologies, White Plains, NY) [3] G. Campobello, G. Patane, and. Russo. Parallel CRC Realization. IEEE Transactions on Computers, 52(10): , October [4] D.. Tullsen, S. J. Eggers, and H.. Levy. Simultaneous ultithreading: aximizing On-Chip Parallelism. In Proceedings of the International Symposium on Computer Architecture, pages , June [5] A. Doering and. Waldvogel. Fast and Flexible CRC Calculation. Electronics Letters, 40(1):10 11, January [6] J. Glossner, S. Dorward, S. Jinturkar,. oudgill, E. Hokenek,. Schulte, and S. Vassiliadis. Sandblaster Software Tools. In Proceedings of the Workshop on Embedded Processor Design Challenges: Systems, Architectures, odeling, and Simulation, July [7] H.. Ji and E. Killian. Fast Parallel CRC Algorithm and Implementation on a Configurable Processor. In Proceedings of IEEE International Conference on Communications, 28 April-2 ay 2002, volume 3, pages , [8] R. Lee. Cyclic Code Redundancy Designers Guide Protects Data. Digital Design, 11(7):77 85, July [9]. Schulte, J. Glossner, S. amidi,. oudgill, and S. Vassiliadis. A Low-Power ultithreaded Processor for Baseband Communication Systems. Embedded Processor Design Challenges: Systems, Architectures, odeling, and Simulation, Lecture tes in Computer Science, 3133: , July [10] A. K. Pandeya and T. J. Cassa. Parallel CRC Lets any Lines Use One Circuit. Computer Design, 14(9):87 91, [11] T.-B. Pei and C. Zukowski. High-Speed Parallel CRC Circuits in VLSI. IEEE Transactions on Communications, 40(4): , [12] T. V. Ramabadran and S. S. Gaitonde. Tutorial on CRC computations. IEEE icro, 8(4):62 75, [13] S. Jinturkar, J. Glossner, V. Kotlyar, and. oudgill. The Sandblaster Automatic ultithreaded Vectorizing Compiler. In Proceedings of the 2004 Global Signal Processing Expo and International Signal Processing Conference, September [14] S. amidi,. Schulte, D. Iancu, A. Iancu, and J. Glossner. Instruction Set Extensions for Reed-Solomon Encoding and Decoding. IEEE 16th International Conference on Application-specific Systems, Architectures and Processors Samos, Greece, [15].-D. Shieh,.-H. Sheu, C.-H. Chen, and H.-F. Lo. A Systematic Approach for Parallel CRC Computations. Journal of Information Science and Engineering, 17(3):445 61, ay REFERENCES [1] A. Perez. Byte-wise CRC Calculations. IEEE icro, pages 40 50, June [2] G. Albertengo and R. Sisto. Parallel CRC generation. IEEE icro, 10(5):63 71, October 1990.

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner Sandbridge Technologies, 1 North Lexington Avenue, White Plains, NY 10601 sjinturkar@sandbridgetech.com

More information

Instruction Set Extensions for Software Defined Radio on a Multithreaded Processor

Instruction Set Extensions for Software Defined Radio on a Multithreaded Processor Instruction Set Extensions for Software Defined Radio on a Multithreaded Processor Suman Mamidi, Emily R. Blem, Michael J. Schulte University of Wisconsin-Madison 45 Engineering Dr. Madison, WI 5376, US

More information

ECE 341. Lecture # 6

ECE 341. Lecture # 6 ECE 34 Lecture # 6 Instructor: Zeshan Chishti zeshan@pdx.edu October 5, 24 Portland State University Lecture Topics Design of Fast Adders Carry Looakahead Adders (CLA) Blocked Carry-Lookahead Adders Multiplication

More information

A Novel Approach for Parallel CRC generation for high speed application

A Novel Approach for Parallel CRC generation for high speed application 22 International Conference on Communication Systems and Network Technologies A Novel Approach for Parallel CRC generation for high speed application Hitesh H. Mathukiya Electronics and communication Department,

More information

DESIGN AND IMPLEMENTATION OF HIGH SPEED 64-BIT PARALLEL CRC GENERATION

DESIGN AND IMPLEMENTATION OF HIGH SPEED 64-BIT PARALLEL CRC GENERATION DESIGN AND IMPLEMENTATION OF HIGH SPEED 64-BIT PARALLEL CRC GENERATION MADHURI KARUMANCHI (1), JAYADEEP PAMULAPATI (2), JYOTHI KIRAN GAVINI (3) 1 Madhuri Karumanchi, Dept Of ECE, SRM University, Kattankulathur,

More information

VLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming

VLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 2, Issue 5 (May. Jun. 203), PP 66-72 e-issn: 239 4200, p-issn No. : 239 497 VLSI Implementation of Parallel CRC Using Pipelining, Unfolding

More information

2 Asst Prof, Kottam College of Engineering, Chinnatekur, Kurnool, AP-INDIA,

2 Asst Prof, Kottam College of Engineering, Chinnatekur, Kurnool, AP-INDIA, www.semargroups.org ISSN 2319-8885 Vol.02,Issue.06, July-2013, Pages:413-418 A H/W Efficient 64-Bit Parallel CRC for High Speed Data Transactions P.ABDUL RASOOL 1, N.MOHAN RAJU 2 1 Research Scholar, Kottam

More information

Sandblaster Low-Power Multithreaded SDR Baseband Processor

Sandblaster Low-Power Multithreaded SDR Baseband Processor Sandblaster Low-Power Multithreaded SDR Baseband Processor John Glossner 1,3, Michael Schulte 2, Mayan Moudgill 1, Daniel Iancu 1, Sanjay Jinturkar 1, Tanuj Raja 1, Gary Nacer 1, and Stamatis Vassiliadis

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

A Low-Power Carry Skip Adder with Fast Saturation

A Low-Power Carry Skip Adder with Fast Saturation A Low-Power Carry Skip Adder with Fast Saturation Michael Schulte,3, Kai Chirca,2, John Glossner,2,Suman Mamidi,3, Pablo Balzola, and Stamatis Vassiliadis 2 Sandbridge Technologies, Inc. White Plains,

More information

Implementing CRCCs. Introduction. in Altera Devices

Implementing CRCCs. Introduction. in Altera Devices Implementing CRCCs in Altera Devices July 1995, ver. 1 Application Note 49 Introduction Redundant encoding is a method of error detection that spreads the information across more bits than the original

More information

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram: The CPU and Memory How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram: 1 Registers A register is a permanent storage location within

More information

CSE 123A Computer Networks

CSE 123A Computer Networks CSE 123A Computer Networks Winter 2005 Lecture 4: Data-Link I: Framing and Errors Some portions courtesy Robin Kravets and Steve Lumetta Last time How protocols are organized & why Network layer Data-link

More information

A Quadruple Precision and Dual Double Precision Floating-Point Multiplier

A Quadruple Precision and Dual Double Precision Floating-Point Multiplier A Quadruple Precision and Dual Double Precision Floating-Point Multiplier Ahmet Akkaş Computer Engineering Department Koç University 3445 Sarıyer, İstanbul, Turkey ahakkas@kuedutr Michael J Schulte Department

More information

Lecture 4: CRC & Reliable Transmission. Lecture 4 Overview. Checksum review. CRC toward a better EDC. Reliable Transmission

Lecture 4: CRC & Reliable Transmission. Lecture 4 Overview. Checksum review. CRC toward a better EDC. Reliable Transmission 1 Lecture 4: CRC & Reliable Transmission CSE 123: Computer Networks Chris Kanich Quiz 1: Tuesday July 5th Lecture 4: CRC & Reliable Transmission Lecture 4 Overview CRC toward a better EDC Reliable Transmission

More information

THE SANDBLASTER AUTOMATIC MULTITHREADED VECTORIZING COMPILER

THE SANDBLASTER AUTOMATIC MULTITHREADED VECTORIZING COMPILER THE SANDBLASTER AUTOMATIC MULTITHREADED VECTORIZING COMPILER Abstract Sanjay Jinturkar *, John Glossner *,, Vladimir Kotlyar *, and Mayan Moudgill * * Sandbridge Technologies, Inc. 1 North Lexington Ave.

More information

Research Article Galois Field Instructions in the Sandblaster 2.0 Architectrue

Research Article Galois Field Instructions in the Sandblaster 2.0 Architectrue Digital Multimedia Broadcasting Volume 2009, Article ID 129698, 5 pages doi:10.1155/2009/129698 Research Article Galois Field Instructions in the Sandblaster 2.0 Architectrue Mayan Moudgill, 1 Andrei Iancu,

More information

Computer Organization & Assembly Language Programming

Computer Organization & Assembly Language Programming Computer Organization & Assembly Language Programming CSE 2312-002 (Fall 2011) Lecture 5 Memory Junzhou Huang, Ph.D. Department of Computer Science and Engineering Fall 2011 CSE 2312 Computer Organization

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

A RECONFIGURABLE BASEBAND FOR 2.5G/3G AND BEYOND

A RECONFIGURABLE BASEBAND FOR 2.5G/3G AND BEYOND A RECONFIGURABLE BASEBAND FOR 2.5G/3G AND BEYOND John Glossner, Daniel Iancu, Erdem Hokenek, and Mayan Moudgill Sandbridge Technologies, Inc. White Plains, NY 914-287-8500 glossner@sandbridgetech.com ABSTRACT

More information

On the Implementation of a Three-operand Multiplier

On the Implementation of a Three-operand Multiplier On the Implementation of a Three-operand Multiplier Robert McIlhenny rmcilhen@cs.ucla.edu Computer Science Department University of California Los Angeles, CA 9002 Miloš D. Ercegovac milos@cs.ucla.edu

More information

SIMULTANEOUS BASEBAND PROCESSING CONSIDERATIONS IN A MULTI-MODE HANDSET USING THE SANDBLASTER BASEBAND PROCESSOR

SIMULTANEOUS BASEBAND PROCESSING CONSIDERATIONS IN A MULTI-MODE HANDSET USING THE SANDBLASTER BASEBAND PROCESSOR SIMULTANEOUS BASEBAND PROCESSING CONSIDERATIONS IN A MULTI-MODE HANDSET USING THE SANDBLASTER BASEBAND PROCESSOR Babak Beheshti, b.beheshti@ieee.org Sanjay Jinturkar, Sandbridge Technologies, sanjay.jinturkar@sandbridgetech.com

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Lecture 5. Homework 2 posted, due September 15. Reminder: Homework 1 due today. Questions? Thursday, September 8 CS 475 Networks - Lecture 5 1

Lecture 5. Homework 2 posted, due September 15. Reminder: Homework 1 due today. Questions? Thursday, September 8 CS 475 Networks - Lecture 5 1 Lecture 5 Homework 2 posted, due September 15. Reminder: Homework 1 due today. Questions? Thursday, September 8 CS 475 Networks - Lecture 5 1 Outline Chapter 2 - Getting Connected 2.1 Perspectives on Connecting

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Lecture Topics ECE 341. Lecture # 6. CLA Delay Calculation. CLA Fan-in Limitation

Lecture Topics ECE 341. Lecture # 6. CLA Delay Calculation. CLA Fan-in Limitation EE 34 Lecture # 6 Instructor: Zeshan hishti zeshan@pdx.edu October 5, 24 Lecture Topics Design of Fast dders arry Looakaheaddders (L) Blocked arry-lookahead dders ultiplication of Unsigned Numbers rray

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Performance Evaluation of a Novel Direct Table Lookup Method and Architecture With Application to 16-bit Integer Functions

Performance Evaluation of a Novel Direct Table Lookup Method and Architecture With Application to 16-bit Integer Functions Performance Evaluation of a Novel Direct Table Lookup Method and Architecture With Application to 16-bit nteger Functions L. Li, Alex Fit-Florea, M. A. Thornton, D. W. Matula Southern Methodist University,

More information

Multithreading and the Tera MTA. Multithreading for Latency Tolerance

Multithreading and the Tera MTA. Multithreading for Latency Tolerance Multithreading and the Tera MTA Krste Asanovic krste@lcs.mit.edu http://www.cag.lcs.mit.edu/6.893-f2000/ 6.893: Advanced VLSI Computer Architecture, October 31, 2000, Lecture 6, Slide 1. Krste Asanovic

More information

Available online at ScienceDirect. Procedia Technology 25 (2016 )

Available online at  ScienceDirect. Procedia Technology 25 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Technology 25 (2016 ) 544 551 Global Colloquium in Recent Advancement and Effectual Researches in Engineering, Science and Technology (RAEREST

More information

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 6: CODING THEORY - 2 Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian Agenda Hamming Codes

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Performance Evaluation & Design Methodologies for Automated CRC Checking for 32 bit address Using HDLC Block

Performance Evaluation & Design Methodologies for Automated CRC Checking for 32 bit address Using HDLC Block Performance Evaluation & Design Methodologies for Automated CRC Checking for 32 bit address Using HDLC Block 32 Bit Neeraj Kumar Misra, (Assistant professor, Dept. of ECE, R D Foundation Group of Institution

More information

Data Link Networks. Hardware Building Blocks. Nodes & Links. CS565 Data Link Networks 1

Data Link Networks. Hardware Building Blocks. Nodes & Links. CS565 Data Link Networks 1 Data Link Networks Hardware Building Blocks Nodes & Links CS565 Data Link Networks 1 PROBLEM: Physically connecting Hosts 5 Issues 4 Technologies Encoding - encoding for physical medium Framing - delineation

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

Design of Flash Controller for Single Level Cell NAND Flash Memory

Design of Flash Controller for Single Level Cell NAND Flash Memory Design of Flash Controller for Single Level Cell NAND Flash Memory Ashwin Bijoor 1, Sudharshana 2 P.G Student, Department of Electronics and Communication, NMAMIT, Nitte, Karnataka, India 1 Assistant Professor,

More information

VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT

VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT K.Sandyarani 1 and P. Nirmal Kumar 2 1 Research Scholar, Department of ECE, Sathyabama

More information

Some portions courtesy Robin Kravets and Steve Lumetta

Some portions courtesy Robin Kravets and Steve Lumetta CSE 123 Computer Networks Fall 2009 Lecture 4: Data-Link I: Framing and Errors Some portions courtesy Robin Kravets and Steve Lumetta Administrative updates I m Im out all next week no lectures, but You

More information

Microprocessor Extensions for Wireless Communications

Microprocessor Extensions for Wireless Communications Microprocessor Extensions for Wireless Communications Sridhar Rajagopal and Joseph R. Cavallaro DRAFT REPORT Rice University Center for Multimedia Communication Department of Electrical and Computer Engineering

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

High-Performance Cryptography in Software

High-Performance Cryptography in Software High-Performance Cryptography in Software Peter Schwabe Research Center for Information Technology Innovation Academia Sinica September 3, 2012 ECRYPT Summer School: Challenges in Security Engineering

More information

2.4 Error Detection Bit errors in a frame will occur. How do we detect (and then. (or both) frames contains an error. This is inefficient (and not

2.4 Error Detection Bit errors in a frame will occur. How do we detect (and then. (or both) frames contains an error. This is inefficient (and not CS475 Networks Lecture 5 Chapter 2: Direct Link Networks Assignments Reading for Lecture 6: Sections 2.6 2.8 Homework 2: 2.1, 2.4, 2.6, 2.14, 2.18, 2.31, 2.35. Due Thursday, Sept. 15 2.4 Error Detection

More information

CS 101, Mock Computer Architecture

CS 101, Mock Computer Architecture CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

A Brief Description of the NMP ISA and Benchmarks

A Brief Description of the NMP ISA and Benchmarks Report No. UIUCDCS-R-2005-2633 UILU-ENG-2005-1823 A Brief Description of the NMP ISA and Benchmarks by Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine February 2005 A Brief Description

More information

CRC Generation for Protocol Processing

CRC Generation for Protocol Processing CRC Generation for Protocol Processing Ulf Nordqvist, Tomas Henrikson and Dake Liu Department of Physics and Measurement Technology Linköpings University, SE 58183 Linköping, Sweden Phone: +46-1328-{8916,

More information

Towards a Java-enabled 2Mbps wireless handheld device

Towards a Java-enabled 2Mbps wireless handheld device Towards a Java-enabled 2Mbps wireless handheld device John Glossner 1, Michael Schulte 2, and Stamatis Vassiliadis 3 1 Sandbridge Technologies, White Plains, NY 2 Lehigh University, Bethlehem, PA 3 Delft

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications

A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications Metin Mete Özbilen 1 and Mustafa Gök 2 1 Mersin University, Engineering Faculty, Department of Computer Science,

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*

ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER* IJVD: 3(1), 2012, pp. 21-26 ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER* Anbuselvi M. and Salivahanan S. Department of Electronics and Communication

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

signature i-1 signature i instruction j j+1 branch adjustment value "if - path" initial value signature i signature j instruction exit signature j+1

signature i-1 signature i instruction j j+1 branch adjustment value if - path initial value signature i signature j instruction exit signature j+1 CONTROL FLOW MONITORING FOR A TIME-TRIGGERED COMMUNICATION CONTROLLER Thomas M. Galla 1, Michael Sprachmann 2, Andreas Steininger 1 and Christopher Temple 1 Abstract A novel control ow monitoring scheme

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS 1.Define Computer Architecture Computer Architecture Is Defined As The Functional Operation Of The Individual H/W Unit In A Computer System And The

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009 Digital Signal Processing 8 December 24, 2009 VIII. DSP Processors 2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator (MAC), Modified bus structures and memory access

More information

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution Important guidelines: Always state your assumptions and clearly explain your answers. Please upload your solution document

More information

Partitioned Branch Condition Resolution Logic

Partitioned Branch Condition Resolution Logic 1 Synopsys Inc. Synopsys Module Compiler Group 700 Middlefield Road, Mountain View CA 94043-4033 (650) 584-5689 (650) 584-1227 FAX aamirf@synopsys.com http://aamir.homepage.com Partitioned Branch Condition

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved. + William Stallings Computer Organization and Architecture 10 th Edition 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. 2 + Chapter 3 A Top-Level View of Computer Function and Interconnection

More information

COSC 243. Computer Architecture 1. COSC 243 (Computer Architecture) Lecture 6 - Computer Architecture 1 1

COSC 243. Computer Architecture 1. COSC 243 (Computer Architecture) Lecture 6 - Computer Architecture 1 1 COSC 243 Computer Architecture 1 COSC 243 (Computer Architecture) Lecture 6 - Computer Architecture 1 1 Overview Last Lecture Flip flops This Lecture Computers Next Lecture Instruction sets and addressing

More information

EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC

EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC April 12, 2012 John Wawrzynek Spring 2012 EECS150 - Lec24-hdl3 Page 1 Parallelism Parallelism is the act of doing more than one thing

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group

More information

Reminder: tutorials start next week!

Reminder: tutorials start next week! Previous lecture recap! Metrics of computer architecture! Fundamental ways of improving performance: parallelism, locality, focus on the common case! Amdahl s Law: speedup proportional only to the affected

More information

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier U.V.N.S.Suhitha Student Department of ECE, BVC College of Engineering, AP, India. Abstract: The ever growing need for improved

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

ECE 486/586. Computer Architecture. Lecture # 7

ECE 486/586. Computer Architecture. Lecture # 7 ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix

More information

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital hardware modules that accomplish a specific information-processing task. Digital systems vary in

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit P Ajith Kumar 1, M Vijaya Lakshmi 2 P.G. Student, Department of Electronics and Communication Engineering, St.Martin s Engineering College,

More information

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations Chapter 4 The Processor Part I Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications , Vol 7(4S), 34 39, April 204 ISSN (Print): 0974-6846 ISSN (Online) : 0974-5645 Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications B. Vignesh *, K. P. Sridhar

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008 Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.

More information

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Data Speculation. Architecture. Carnegie Mellon School of Computer Science Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A

More information

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Lecture 25: Busses. A Typical Computer Organization

Lecture 25: Busses. A Typical Computer Organization S 09 L25-1 18-447 Lecture 25: Busses James C. Hoe Dept of ECE, CMU April 27, 2009 Announcements: Project 4 due this week (no late check off) HW 4 due today Handouts: Practice Final Solutions A Typical

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR.

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR. 2015; 2(2): 201-209 IJMRD 2015; 2(2): 201-209 www.allsubjectjournal.com Received: 07-01-2015 Accepted: 10-02-2015 E-ISSN: 2349-4182 P-ISSN: 2349-5979 Impact factor: 3.762 Aiyar, Mani Laxman Dept. Of ECE,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 Computer Systems Organization The CPU (Central Processing Unit) is the brain of the computer. Fetches instructions from main memory.

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

ECE 341 Final Exam Solution

ECE 341 Final Exam Solution ECE 341 Final Exam Solution Time allowed: 110 minutes Total Points: 100 Points Scored: Name: Problem No. 1 (10 points) For each of the following statements, indicate whether the statement is TRUE or FALSE.

More information

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Journal of Computational Information Systems 7: 8 (2011) 2843-2850 Available at http://www.jofcis.com High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Meihua GU 1,2, Ningmei

More information

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz (and Pure Storage) Authors Jim Plank Univ.

More information

M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE 6.111 Introductory Digital Systems Laboratory Fall 2017 Lecture PSet #6 of

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information