Instruction Set Extensions for Cyclic Redundancy Check on a Multithreaded Processor
|
|
- Anis Chapman
- 6 years ago
- Views:
Transcription
1 Instruction Set Extensions for Cyclic Redundancy Check on a ultithreaded Processor Emily R. Blem Dept. of ECE University of Wisconsin-adison adison, WI blem@cae.wisc.edu Suman amidi Dept. of ECE University of Wisconsin-adison adison, WI mamidi@cae.wisc.edu ichael J. Schulte Dept. of ECE University of Wisconsin-adison adison, WI schulte@engr.wisc.edu ABSTRACT Cyclic redundancy check (CRC) algorithms are widely used for error detection in wireless communication systems. CRC is a simple algorithm, but implementations on conventional processors are inefficient as the CRC algorithm is serial and based on bit-wise operations. In this paper, we explore several instruction set extensions to the Sandbridge multithreaded processor for CRC. The performance speedup of each extension is evaluated using the Sandbridge software tools, and the area and delay of the corresponding hardware is presented. The instruction set extensions produce performance gains of up to 23.0x for the CRC kernel. Start Residue[-1:0] = initial value i = 0 Residue = {data[i], Residue[-1:1]} Residue[-1] = 1? i++ Residue = Residue XOR Polynomial 1. INTRODUCTION Cyclic redundancy check (CRC) computations are used in a variety of applications, especially when information transmission or reception is involved. Wireless communication standards that use CRC include Bluetooth, WiAX, WCDA, and WLAN. The basic CRC computation is to divide the incoming bit-stream by an irreducible polynomial and compare the residue before and after transmission. CRC-8 and CRC-CCITT are CRC standards that use an 8- bit polynomial. Similarly, CRC-16 uses a 16-bit polynomial and CRC-32 uses a 32-bit polynomial. Although CRC-32 is the most common CRC computation, CRC-8 and CRC-16 are also useful. This paper examines implementations for all three CRC lengths. The CRC algorithm involves shifts and bit-wise XOR computations, as shown in Figure 1, where is the length of the CRC polynomial. The initial value is set to either 0 or an -bit string of ones. The original CRC hardware implementation shifts one bit of data at a time into a linear feedback shift register (LFSR), so a CRC check on a message that is N bits long requires the execution of a set of operations N times [8]. Implementing an LFSR in software is extremely inefficient as it requires a series of operations on individual bits [12]. To improve performance, CRC calculations are often performed by shifting in 8 bits at a time and using a look-up table with all 256 possible CRC products [1]. Although this significantly speeds up the computation, the algorithm is still serial and thus does not utilize any available parallel hardware. There is significant literature on various parallel CRC algorithms [5, 10, 11]. In [2], z-transforms from digital filter theory are used to parallelize the CRC computation. Galois field implementations are used in [15] with lookahead i < total data length? Figure 1: CRC Algorithm techniques for parallel CRC computation, and they are used again in [7, 3] without lookahead techniques. Although algorithms have been developed to better perform CRC calculations in software, conventional instruction set architectures (ISAs) are still unsuited to operations on individual bits of data. It is therefore desirable to use instruction set extensions to further improve the performance of CRC computations. In this paper, we develop ISA extensions and corresponding hardware designs to implement two different CRC algorithms on the Sandblaster multithreaded processor, and examine the corresponding hardware area and worst case delay, as well as the overall speedup. The Sandblaster processor is designed to support efficient execution of wireless communication and multimedia applications. In high-bandwidth mobile communication systems, standards like WDCA, WLAN, and WiAX must execute quickly and efficiently. Since the CRC-32 standard, in particular, is a part of these increasingly important communication standards, developing instruction set extensions to improve the performance of CRC is an important task. The paper is organized as follows: Section 2 provides an overview of the Sandblaster architecture for which the ISA extensions are developed. Section 3 discusses the CRC al-
2 DIR LRU Replace I-Cache 64KB 64B Lines 4W (2-Active) I-Decode Bus/emory Interface Data emory 64KB 8-Banks L0: lvu %vr0, %r3, 8 vmulreds %ac0, %vr0, %vr0, %ac0 loop %lc0, L0 SIDIQ Instruction Fetch and Branch Unit Integer and Load/ Store Unit SID Vector Unit Figure 3: A 64-bit Compound Instruction T0 T7 T2 T5 T4 T3 T6 T1 Figure 2: Sandblaster Processor gorithms, instruction set extensions, and hardware designs. Section 4 gives experimental results, including overall CRC computation speedups and hardware area and worst case delay. Section 5 provides a summary of the work and suggests a preferred solution for CRC computation using instruction set extensions for the Sandblaster architecture. 2. SANDBLASTER PROCESSOR The Sandblaster processor is designed for embedded mobile communication and multimedia systems with features including compound instructions, SID vector operations, and hardware support for multiple threads. It uses tokentriggered threading and has three units that operate in parallel: an instruction fetch and branch unit, an integer and load/store unit, and a SID vector unit. Our instruction set extensions are implemented in both the integer unit and the vector unit. The integer unit takes up to two 32-bit operands and outputs one 32-bit operand. The vector unit takes up to three operands, where each operand corresponds to a 4- element vector, and outputs one 4-element vector. Figure 2 shows a block diagram of these units and the Sandblaster memory subsytem [6, 13, 9]. The three execution units can be utilized in parallel with the Sandblaster 64-bit compound instruction format. Figure 3 shows a single compound instruction. The operation lvu loads vector register vr0 with four 16-bit elements and updates r3, the address pointer. Concurrently, vmulreds squares the contents of vr0, performs saturating addition with the current accumulator ac0, and puts the result back in ac0. In the branch unit, the loop instruction decrements a counter and branches back to L0 if the loop count is not zero. The Sandblaster processor uses a unique form of interleaved multithreading, called Token Triggered Threading (T 3 ), which is illustrated in Figure 4. With T 3, all threads can be simultaneously executing instructions, but only one thread may issue an instruction on a cycle boundary [9]. This constraint is also imposed on round-robin threading. What distinguishes T 3 is that each clock cycle a token indicates the subsequent thread that is to issue an instruction. Thread ordering may be sequential (e.g. round robin), even/odd, or based on other communication patterns. Compared to Simultaneous ultithreading (ST) [4], T 3 has much less hardware complexity and power dissipation since the method for selecting threads is simplified, only a single compound instruction issues each clock cycle, and dependency checking hardware is eliminated. The current implementation of the Sandblaster Processor supports up to eight simultaneous threads of execution per processor core. Figure 4: Token Triggered Threading The SID vector processing unit (VPU) has four vector processing elements (VPEs). They execute arithmetic and logic operations on 16-bit, 32-bit, or 40-bit fixed-point vector elements in SID fashion. The VPU architecture also contains an accumulator register file, a reduction unit, and a shuffle unit, as shown in Figure 5. High-speed 64-bit data busses allow four 16-bit loads or stores each cycle. For 32- bit operands, load-vector-upper (lvu) and load-vector-lower (lvl) instructions are used to load the data into the VPU in two consecutive thread cycles. ost of the Sandblaster operations have eight pipeline stages, but this latency is hidden by the eight cycles between consecutive instructions in a single thread. This eight cycle latency provides up to four execution stages to perform our instruction set extension calculations, so our extensions can have fairly high latencies and complexities. We present two sets of instruction set extensions; one set is implemented as operations in the integer unit, while the other is implemented in the vector unit. Each of the operations can be included in a compound instruction. For example, our vector CRC operation can replace the vmulreds operation in the compound instruction just described. 3. INSTRUCTION SET EXTENSIONS AND HARDWARE DESIGNS The instruction set extensions are designed to fit within the constraints of the Sandblaster processor architecture. However, the operation designed for the integer unit could be implemented on most processors, and the operation designed to be used in the the vector unit could be used in most processors with a SID-type architecture. To examine the performance benefits of our ISA extensions, the basic CRC algorithm and a Galois field CRC algorithm [7] are written in C-code. We profile these algorithms to find the compute intensive portions of the code. Those portions are then replaced with new operations and we design hardware to perform that portion of the code. The compute intensive portions of code are added to the Sandbridge compiler and simulator as intrinsics. The compiler then treats the new intrinsics as any other operation when scheduling and optimizing the code. The Sandbridge simulator is used to generate cycle counts for the code before and after adding the operations. Section 3.1 details this process for the integer unit, and Section 3.2 discusses the vector unit process. 3.1 Integer Unit
3 Start Residue = 0 Start Read 8 bits into data Residue = 0 index = (Residue XOR data) & 0xff Read 8 bits into data tresidue = lookup(index) reduction(residue, Residue, data) Residue = (tresidue XOR Residue) >> 8 ore data? ore data? Figure 6: CRC Polynomial Reduction Algorithm and Intrinsic Optimization Load Data Store Data VPE0 VPE2 Shuffle Unit VPE1 VPE3 Figure 6 shows the optimization of the CRC algorithm using the integer unit. On the left, we show the original algorithm as implemented in C-code. This is the basic algorithm using an 8-bit table lookup discussed in Section 1 [1]. The shaded boxes are the compute intensive code segments that we replace with an intrinsic. The new algorithm, including the intrinsic reduction, is shown on the right. The intrinsic format is reduction(outputresidue, inputresidue, inputdata). We implement this strategy for CRC-8, CRC-16, and CRC-32 for data chunks of 8, 16, and 32 bits. The hardware corresponding to a 4-bit CRC hardware unit that processes -bits of data is shown in Figure 7. The old residue (denoted by r) and the incoming data (denoted by d) are inputs to the hardware, and the new residue is the output. The polynomial is programmable and is stored in a special-purpose register. Accumulator Data Reduction Unit Accumulator Register File Figure 5: SID Vector Processing Unit 3.2 Vector Unit The vector unit performs a single operation in parallel on four sets of input operands, so we implement a parallel algorithm that uses Galois field operations and is presented in [7]. Galois field arithmetic can be performed efficiently using specialized hardware and be used to implement other algorithms such as Reed-Solomon coding [14]. The algorithm uses Galois field operations over a Galois field of size 2 to paralellize the CRC computation. See [7] for the complete derivation. This field is denoted as (2 ), where is the number of bits in each operand. There are three stages in the implementation: loading the data and CRC polynomial, pre-computing the β factors, and
4 Polynomial d 0 d 1 Old Residue r 3 r 2 r 1 r 0 p 3 p 2 p 1 p 0 p 3 p 2 p 1 p 0 d (-1) p 3 p 2 p 1 p 0 r 3 r 2 r 1 r 0 New Residue Figure 7: CRC Polynomial Reduction Unit then performing Galois field multiplication and addition on the data and β factors. These stages are shown on the left side of Figure 8. In ( 2 ), addition is equivalent to the bitwise XOR of two numbers. ultiplication is a series of XORs and shifts. ultiplying two -bit numbers produces a (2 1)-bit result. This result is then divided by an irreducible polynomial to produce an -bit reduced product. In our case, the CRC generator polynomial is also the Galois reduction polynomial. Loading the data is a simple process; the N-bit message is split into -bit pieces and stored in program memory. The CRC polynomial is stored so that it can be used as the Galois reduction polynomial. The β factors are only dependent on the CRC polynomial and its degree. They can be computed once and, if the CRC generator polynomial remains constant, reused for many different sets of data. So, since most systems use CRC-32 with the polynomial 0x04C11DB7, it is possible to generate the β factors once for many CRC calculations. For this reason, we show performance numbers both with and without β generation in Section 4. The β factors are generated by repeatedly multiplying the CRC generator polynomial by itself using multiplication. After the β factors are generated and the message loaded, all that remains is to properly multiply the β factors by the data chunks and accumulate the result, which is essentially a dot product and is shown in Figure 9. All of these multiplies and additions can occur in parallel or they can occur in series with an accumulator keeping a running total of the additions. Galois field multiplication is not a trivial operation in software, so we implement it in hardware using our instruction set extensions. We introduce two instructions: vgfmul (vector multiply) and vgfmac (vector multiply-accumulate). The intrinsic operand formats are vgfmul (result, multiplicand, multiplier) and vgfmac(result, multiplicand, multiplier, accumulator). Figure 8 shows the implementation with and without the intrinsic. On the left side, + and correspond to Galois field addition and multiplication. The intrinsics operate on vectors with four elements each. The indices in the algorithm on the right refer to the first of those four elements and the next three elements are automatically referenced as well. The β generation process is optimized by generating the first four β values serially, and then repeatedly calling vgfmul to multiply these values by β 3 = β0. 4 The dot product shown in Figure 9 is implemented using vgfmac. There are a total of n = N/ multiplies which must occur, and after each multiply there is an addition, so we use the vgfmac instruction here. The accumulation function is not a critical component, as accumulation is a simple XOR operation that is easily performed in software, but performance is improved by implementing it in hardware with the multiplier. As mentioned in Section 2, each Sandbridge VPE can process 16, 32, and 40-bit data types. However, the default load operation for each VPE is 16 bits. Therefore, when we use 32 bit data, we use two load operations, load vector upper (lvu) and load vector lower (lvl). The compiler is modified to automatically include these special load instructions for instruction set extensions to the vector unit with 32-bit data types. The vgfmul and vgmac hardware is shown in Figure 10 for -bit operands. The preliminary ultiplier is im-
5 Start Start Set CRC polynomial and split data into n chunks of -bits each, called D j Set CRC polynomial and split data into n chunks of -bits each, called D j i = 4, j = 0, A j = 0 0 = Polynomial i = 4, j = 0, A j = 0 0 = Polynomial 1 = 0* 0 2 = 1* 0 3 = 2* 0 1 = 0* 0 2 = 1* 0 3 = 2* 0 i= * (i-1) vgfmul( i, (i-4), 3) i++ i = i+4 i < n? i < n? X j = D j * j vgfmac(a j, D j, j, A (j-1)) A j = A (j-1) + X j j = j+4 Dot Product j++ j < n? j < n? Figure 8: Galois Field CRC Algorithm and Intrinsic Optimization
6 Function 8 bits/cycle 16 bits/cycle 32 bits/cycle CRC-8 2.9x 4.6x 23.0x CRC x 4.6x 23.0x CRC x 4.6x 23.0x Table 1: Unit Speed-up Over Base ISA Using Integer D n-1 D n D 1 n-1 n D 0 0 Function 8 bits/cycle 16 bits/cycle 32 bits/cycle Size (µ 2 ) 1,836 8,462 30,078 Latency (ns) Table 2: Integer Unit Hardware Accelerator Characteristics ULT ULT ADD Final CRC Product ULT ULT Figure 9: Galois Field Dot Product Used in Vector Computation plemented using a parallel array of 2 AND gates, which generateds the partial products of D and β, followed by a tree of 2 XOR gates, which sum the partial products using Galois field addition. The polynomial Galois field reduction unit is similar to that in Figure 7 and uses ( 1) AND gates and ( 1) XOR gates. The AC signal selects between 0 and the accumulator at the end of the operation, allowing the same hardware to be reused for both the vgfmul and vgfmac operations. AC Acc 0 -bit 2-to-1 ux Di i -bit Preliminary ultiply 2-1 -bit Galois Field Polynomial Reduction -bit Galois Field Addition Output Polynomial log2() Figure 10: Galois Field ultiply-accumulate Unit 4. EXPERIENTAL RESULTS Each instruction was simulated using C-code and added to the Sandbridge compiler and simulator as intrinsic instructions that take a single thread cycle to implement. The Sandbridge compiler transforms the intrinsics into an intermediated representation that is optimized and scheduled along with the rest of the code, which lets the new operations be included in the compound instructions and undergo the same optimizations as other operations. Compiler optimizations include vectorization, loop unrolling, software pipelining, code motion, function inlining, and peephole optimizations [13]. We implemented our baseline CRC algorithm and all optimized CRC algorithms using our intrinsics in C-code and simulated them using a data set with 500, bit values. Our 8-bit table lookup implementation was approximately 7 times faster than our bit by bit CRC implementation. We chose the 8-bit table lookup implementation as our baseline for all speedup calculations, since it is a standard software algorithm. Hardware was deigned in Verilog and implemented using the gflxp 0.11 micron COS standard cell library and Synopsys Design Compiler. As shown in Table 1, the speedups achieved in the integer unit are directly proportional to the number of bits of data processed each cycle. Although the size of the arithmetic operations changes with the CRC polynomial length, the speedup is constant when compared to a table lookup algorithm, since in all cases the tables fits in memory and our accelerator performance constraint is the amount of data loaded, not the polynomial length. Hardware area and worst case delay are shown in Table 2. The hardware area and worst case delay both increase as the numbers of bits processed per instruction increases, so the tradeoff between hardware cost and potential acceleration has to be taken into consideration in choosing the proper hardware accelerator size. Table 3 shows speedups using the Galois field CRC algorithm both when β factor generation is included and when
7 Function Including β t Including β CRC-8 on 8-bit AC 3.8x 7.7x CRC-16 on 16-bit AC 7.7x 15.3x CRC-32 on 32-bit AC 4.8x 7.7x Table 3: Speed-up Over Base ISA Using Vector Unit Function CRC-8 CRC-16 CRC-32 Area (µ 2 /VPE) 3,963 18,881 86,111 Latency (ns) Table 4: Vector Unit Hardware Accelerator Characteristics it is not included in the computation time. The CRC-8 algorithm is implemented with 8-bit operations, so it processes 8 bits of data each time that the vgfmac instruction is called. The CRC-16 and CRC-32 algorithms are scaled similairly. The CRC-32 implementation processes more bits per vgfmac operation than the other two implementations, but it is slowed by the increased number of loads that it must perform. Since most computation has been moved to an intrinsic operation, the majority of operations in the assembly code are loads and stores. As discussed in Section 2, the vector unit requires two load instructions to load a vector with 32-bit elements. So, the CRC-16 and CRC-32 implementations have the same total number of loads and stores, even though the CRC-32 implementation calls the vgfmac instruction half as often. Additionally, the loads and stores in the CRC-32 implementation happen in more concentrated bursts, and the compiler creates less optimized compound instructions than in the CRC-16 implementation. In the vector unit, the hardware costs for operations are shown in Table 4. The area increases quadratically as the operand size increases. 5. SUARY This paper gave an overview of the CRC algorithm, discussed the Sandblaster architecture, explained our implementations, and gave the final speedup numbers. We have shown that instruction set extensions can be effectively used to improve the performance of CRC calculations. The hardware added to the integer unit produced a speedup of up to 23.0 times the baseline implementation. In the vector unit, we achieved a speedup of up to 15.3 times. We are limited by the increased software overhead of the Galois field algorithm and the extra loads from the β factors. Although we did not examine memory use, it is significant to note that the β factors from the Galois field implementation and the look up table from the baseline implementation must each be stored in memory, while our integer unit implementation does not have any additonal memory overhead. 6. ADDITIONAL AUTHORS John Glossner, Daniel Iancu, ayan oudgill, and Sanjay Jinturkar (Sandbridge Technologies, White Plains, NY) [3] G. Campobello, G. Patane, and. Russo. Parallel CRC Realization. IEEE Transactions on Computers, 52(10): , October [4] D.. Tullsen, S. J. Eggers, and H.. Levy. Simultaneous ultithreading: aximizing On-Chip Parallelism. In Proceedings of the International Symposium on Computer Architecture, pages , June [5] A. Doering and. Waldvogel. Fast and Flexible CRC Calculation. Electronics Letters, 40(1):10 11, January [6] J. Glossner, S. Dorward, S. Jinturkar,. oudgill, E. Hokenek,. Schulte, and S. Vassiliadis. Sandblaster Software Tools. In Proceedings of the Workshop on Embedded Processor Design Challenges: Systems, Architectures, odeling, and Simulation, July [7] H.. Ji and E. Killian. Fast Parallel CRC Algorithm and Implementation on a Configurable Processor. In Proceedings of IEEE International Conference on Communications, 28 April-2 ay 2002, volume 3, pages , [8] R. Lee. Cyclic Code Redundancy Designers Guide Protects Data. Digital Design, 11(7):77 85, July [9]. Schulte, J. Glossner, S. amidi,. oudgill, and S. Vassiliadis. A Low-Power ultithreaded Processor for Baseband Communication Systems. Embedded Processor Design Challenges: Systems, Architectures, odeling, and Simulation, Lecture tes in Computer Science, 3133: , July [10] A. K. Pandeya and T. J. Cassa. Parallel CRC Lets any Lines Use One Circuit. Computer Design, 14(9):87 91, [11] T.-B. Pei and C. Zukowski. High-Speed Parallel CRC Circuits in VLSI. IEEE Transactions on Communications, 40(4): , [12] T. V. Ramabadran and S. S. Gaitonde. Tutorial on CRC computations. IEEE icro, 8(4):62 75, [13] S. Jinturkar, J. Glossner, V. Kotlyar, and. oudgill. The Sandblaster Automatic ultithreaded Vectorizing Compiler. In Proceedings of the 2004 Global Signal Processing Expo and International Signal Processing Conference, September [14] S. amidi,. Schulte, D. Iancu, A. Iancu, and J. Glossner. Instruction Set Extensions for Reed-Solomon Encoding and Decoding. IEEE 16th International Conference on Application-specific Systems, Architectures and Processors Samos, Greece, [15].-D. Shieh,.-H. Sheu, C.-H. Chen, and H.-F. Lo. A Systematic Approach for Parallel CRC Computations. Journal of Information Science and Engineering, 17(3):445 61, ay REFERENCES [1] A. Perez. Byte-wise CRC Calculations. IEEE icro, pages 40 50, June [2] G. Albertengo and R. Sisto. Parallel CRC generation. IEEE icro, 10(5):63 71, October 1990.
IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner
IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner Sandbridge Technologies, 1 North Lexington Avenue, White Plains, NY 10601 sjinturkar@sandbridgetech.com
More informationInstruction Set Extensions for Software Defined Radio on a Multithreaded Processor
Instruction Set Extensions for Software Defined Radio on a Multithreaded Processor Suman Mamidi, Emily R. Blem, Michael J. Schulte University of Wisconsin-Madison 45 Engineering Dr. Madison, WI 5376, US
More informationECE 341. Lecture # 6
ECE 34 Lecture # 6 Instructor: Zeshan Chishti zeshan@pdx.edu October 5, 24 Portland State University Lecture Topics Design of Fast Adders Carry Looakahead Adders (CLA) Blocked Carry-Lookahead Adders Multiplication
More informationA Novel Approach for Parallel CRC generation for high speed application
22 International Conference on Communication Systems and Network Technologies A Novel Approach for Parallel CRC generation for high speed application Hitesh H. Mathukiya Electronics and communication Department,
More informationDESIGN AND IMPLEMENTATION OF HIGH SPEED 64-BIT PARALLEL CRC GENERATION
DESIGN AND IMPLEMENTATION OF HIGH SPEED 64-BIT PARALLEL CRC GENERATION MADHURI KARUMANCHI (1), JAYADEEP PAMULAPATI (2), JYOTHI KIRAN GAVINI (3) 1 Madhuri Karumanchi, Dept Of ECE, SRM University, Kattankulathur,
More informationVLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming
IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 2, Issue 5 (May. Jun. 203), PP 66-72 e-issn: 239 4200, p-issn No. : 239 497 VLSI Implementation of Parallel CRC Using Pipelining, Unfolding
More information2 Asst Prof, Kottam College of Engineering, Chinnatekur, Kurnool, AP-INDIA,
www.semargroups.org ISSN 2319-8885 Vol.02,Issue.06, July-2013, Pages:413-418 A H/W Efficient 64-Bit Parallel CRC for High Speed Data Transactions P.ABDUL RASOOL 1, N.MOHAN RAJU 2 1 Research Scholar, Kottam
More informationSandblaster Low-Power Multithreaded SDR Baseband Processor
Sandblaster Low-Power Multithreaded SDR Baseband Processor John Glossner 1,3, Michael Schulte 2, Mayan Moudgill 1, Daniel Iancu 1, Sanjay Jinturkar 1, Tanuj Raja 1, Gary Nacer 1, and Stamatis Vassiliadis
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationA Low-Power Carry Skip Adder with Fast Saturation
A Low-Power Carry Skip Adder with Fast Saturation Michael Schulte,3, Kai Chirca,2, John Glossner,2,Suman Mamidi,3, Pablo Balzola, and Stamatis Vassiliadis 2 Sandbridge Technologies, Inc. White Plains,
More informationImplementing CRCCs. Introduction. in Altera Devices
Implementing CRCCs in Altera Devices July 1995, ver. 1 Application Note 49 Introduction Redundant encoding is a method of error detection that spreads the information across more bits than the original
More informationThe CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:
The CPU and Memory How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram: 1 Registers A register is a permanent storage location within
More informationCSE 123A Computer Networks
CSE 123A Computer Networks Winter 2005 Lecture 4: Data-Link I: Framing and Errors Some portions courtesy Robin Kravets and Steve Lumetta Last time How protocols are organized & why Network layer Data-link
More informationA Quadruple Precision and Dual Double Precision Floating-Point Multiplier
A Quadruple Precision and Dual Double Precision Floating-Point Multiplier Ahmet Akkaş Computer Engineering Department Koç University 3445 Sarıyer, İstanbul, Turkey ahakkas@kuedutr Michael J Schulte Department
More informationLecture 4: CRC & Reliable Transmission. Lecture 4 Overview. Checksum review. CRC toward a better EDC. Reliable Transmission
1 Lecture 4: CRC & Reliable Transmission CSE 123: Computer Networks Chris Kanich Quiz 1: Tuesday July 5th Lecture 4: CRC & Reliable Transmission Lecture 4 Overview CRC toward a better EDC Reliable Transmission
More informationTHE SANDBLASTER AUTOMATIC MULTITHREADED VECTORIZING COMPILER
THE SANDBLASTER AUTOMATIC MULTITHREADED VECTORIZING COMPILER Abstract Sanjay Jinturkar *, John Glossner *,, Vladimir Kotlyar *, and Mayan Moudgill * * Sandbridge Technologies, Inc. 1 North Lexington Ave.
More informationResearch Article Galois Field Instructions in the Sandblaster 2.0 Architectrue
Digital Multimedia Broadcasting Volume 2009, Article ID 129698, 5 pages doi:10.1155/2009/129698 Research Article Galois Field Instructions in the Sandblaster 2.0 Architectrue Mayan Moudgill, 1 Andrei Iancu,
More informationComputer Organization & Assembly Language Programming
Computer Organization & Assembly Language Programming CSE 2312-002 (Fall 2011) Lecture 5 Memory Junzhou Huang, Ph.D. Department of Computer Science and Engineering Fall 2011 CSE 2312 Computer Organization
More information06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli
06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM
More informationFPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST
FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is
More informationA RECONFIGURABLE BASEBAND FOR 2.5G/3G AND BEYOND
A RECONFIGURABLE BASEBAND FOR 2.5G/3G AND BEYOND John Glossner, Daniel Iancu, Erdem Hokenek, and Mayan Moudgill Sandbridge Technologies, Inc. White Plains, NY 914-287-8500 glossner@sandbridgetech.com ABSTRACT
More informationOn the Implementation of a Three-operand Multiplier
On the Implementation of a Three-operand Multiplier Robert McIlhenny rmcilhen@cs.ucla.edu Computer Science Department University of California Los Angeles, CA 9002 Miloš D. Ercegovac milos@cs.ucla.edu
More informationSIMULTANEOUS BASEBAND PROCESSING CONSIDERATIONS IN A MULTI-MODE HANDSET USING THE SANDBLASTER BASEBAND PROCESSOR
SIMULTANEOUS BASEBAND PROCESSING CONSIDERATIONS IN A MULTI-MODE HANDSET USING THE SANDBLASTER BASEBAND PROCESSOR Babak Beheshti, b.beheshti@ieee.org Sanjay Jinturkar, Sandbridge Technologies, sanjay.jinturkar@sandbridgetech.com
More informationA Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors
A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,
More informationLecture 5. Homework 2 posted, due September 15. Reminder: Homework 1 due today. Questions? Thursday, September 8 CS 475 Networks - Lecture 5 1
Lecture 5 Homework 2 posted, due September 15. Reminder: Homework 1 due today. Questions? Thursday, September 8 CS 475 Networks - Lecture 5 1 Outline Chapter 2 - Getting Connected 2.1 Perspectives on Connecting
More informationComputer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra
Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating
More informationLecture Topics ECE 341. Lecture # 6. CLA Delay Calculation. CLA Fan-in Limitation
EE 34 Lecture # 6 Instructor: Zeshan hishti zeshan@pdx.edu October 5, 24 Lecture Topics Design of Fast dders arry Looakaheaddders (L) Blocked arry-lookahead dders ultiplication of Unsigned Numbers rray
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationPerformance Evaluation of a Novel Direct Table Lookup Method and Architecture With Application to 16-bit Integer Functions
Performance Evaluation of a Novel Direct Table Lookup Method and Architecture With Application to 16-bit nteger Functions L. Li, Alex Fit-Florea, M. A. Thornton, D. W. Matula Southern Methodist University,
More informationMultithreading and the Tera MTA. Multithreading for Latency Tolerance
Multithreading and the Tera MTA Krste Asanovic krste@lcs.mit.edu http://www.cag.lcs.mit.edu/6.893-f2000/ 6.893: Advanced VLSI Computer Architecture, October 31, 2000, Lecture 6, Slide 1. Krste Asanovic
More informationAvailable online at ScienceDirect. Procedia Technology 25 (2016 )
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 25 (2016 ) 544 551 Global Colloquium in Recent Advancement and Effectual Researches in Engineering, Science and Technology (RAEREST
More informationEE 6900: FAULT-TOLERANT COMPUTING SYSTEMS
EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 6: CODING THEORY - 2 Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian Agenda Hamming Codes
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationPerformance Evaluation & Design Methodologies for Automated CRC Checking for 32 bit address Using HDLC Block
Performance Evaluation & Design Methodologies for Automated CRC Checking for 32 bit address Using HDLC Block 32 Bit Neeraj Kumar Misra, (Assistant professor, Dept. of ECE, R D Foundation Group of Institution
More informationData Link Networks. Hardware Building Blocks. Nodes & Links. CS565 Data Link Networks 1
Data Link Networks Hardware Building Blocks Nodes & Links CS565 Data Link Networks 1 PROBLEM: Physically connecting Hosts 5 Issues 4 Technologies Encoding - encoding for physical medium Framing - delineation
More informationHigh Speed Special Function Unit for Graphics Processing Unit
High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum
More informationDesign of Flash Controller for Single Level Cell NAND Flash Memory
Design of Flash Controller for Single Level Cell NAND Flash Memory Ashwin Bijoor 1, Sudharshana 2 P.G Student, Department of Electronics and Communication, NMAMIT, Nitte, Karnataka, India 1 Assistant Professor,
More informationVLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT
VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT K.Sandyarani 1 and P. Nirmal Kumar 2 1 Research Scholar, Department of ECE, Sathyabama
More informationSome portions courtesy Robin Kravets and Steve Lumetta
CSE 123 Computer Networks Fall 2009 Lecture 4: Data-Link I: Framing and Errors Some portions courtesy Robin Kravets and Steve Lumetta Administrative updates I m Im out all next week no lectures, but You
More informationMicroprocessor Extensions for Wireless Communications
Microprocessor Extensions for Wireless Communications Sridhar Rajagopal and Joseph R. Cavallaro DRAFT REPORT Rice University Center for Multimedia Communication Department of Electrical and Computer Engineering
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationHigh-Performance Cryptography in Software
High-Performance Cryptography in Software Peter Schwabe Research Center for Information Technology Innovation Academia Sinica September 3, 2012 ECRYPT Summer School: Challenges in Security Engineering
More information2.4 Error Detection Bit errors in a frame will occur. How do we detect (and then. (or both) frames contains an error. This is inefficient (and not
CS475 Networks Lecture 5 Chapter 2: Direct Link Networks Assignments Reading for Lecture 6: Sections 2.6 2.8 Homework 2: 2.1, 2.4, 2.6, 2.14, 2.18, 2.31, 2.35. Due Thursday, Sept. 15 2.4 Error Detection
More informationCS 101, Mock Computer Architecture
CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations
More informationA Brief Description of the NMP ISA and Benchmarks
Report No. UIUCDCS-R-2005-2633 UILU-ENG-2005-1823 A Brief Description of the NMP ISA and Benchmarks by Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine February 2005 A Brief Description
More informationCRC Generation for Protocol Processing
CRC Generation for Protocol Processing Ulf Nordqvist, Tomas Henrikson and Dake Liu Department of Physics and Measurement Technology Linköpings University, SE 58183 Linköping, Sweden Phone: +46-1328-{8916,
More informationTowards a Java-enabled 2Mbps wireless handheld device
Towards a Java-enabled 2Mbps wireless handheld device John Glossner 1, Michael Schulte 2, and Stamatis Vassiliadis 3 1 Sandbridge Technologies, White Plains, NY 2 Lehigh University, Bethlehem, PA 3 Delft
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationA Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications
A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications Metin Mete Özbilen 1 and Mustafa Gök 2 1 Mersin University, Engineering Faculty, Department of Computer Science,
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*
IJVD: 3(1), 2012, pp. 21-26 ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER* Anbuselvi M. and Salivahanan S. Department of Electronics and Communication
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationsignature i-1 signature i instruction j j+1 branch adjustment value "if - path" initial value signature i signature j instruction exit signature j+1
CONTROL FLOW MONITORING FOR A TIME-TRIGGERED COMMUNICATION CONTROLLER Thomas M. Galla 1, Michael Sprachmann 2, Andreas Steininger 1 and Christopher Temple 1 Abstract A novel control ow monitoring scheme
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS
ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS 1.Define Computer Architecture Computer Architecture Is Defined As The Functional Operation Of The Individual H/W Unit In A Computer System And The
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationVIII. DSP Processors. Digital Signal Processing 8 December 24, 2009
Digital Signal Processing 8 December 24, 2009 VIII. DSP Processors 2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator (MAC), Modified bus structures and memory access
More informationEC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution
EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution Important guidelines: Always state your assumptions and clearly explain your answers. Please upload your solution document
More informationPartitioned Branch Condition Resolution Logic
1 Synopsys Inc. Synopsys Module Compiler Group 700 Middlefield Road, Mountain View CA 94043-4033 (650) 584-5689 (650) 584-1227 FAX aamirf@synopsys.com http://aamir.homepage.com Partitioned Branch Condition
More informationAdvanced Caching Techniques (2) Department of Electrical Engineering Stanford University
Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15
More informationWilliam Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ William Stallings Computer Organization and Architecture 10 th Edition 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. 2 + Chapter 3 A Top-Level View of Computer Function and Interconnection
More informationCOSC 243. Computer Architecture 1. COSC 243 (Computer Architecture) Lecture 6 - Computer Architecture 1 1
COSC 243 Computer Architecture 1 COSC 243 (Computer Architecture) Lecture 6 - Computer Architecture 1 1 Overview Last Lecture Flip flops This Lecture Computers Next Lecture Instruction sets and addressing
More informationEECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC
EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC April 12, 2012 John Wawrzynek Spring 2012 EECS150 - Lec24-hdl3 Page 1 Parallelism Parallelism is the act of doing more than one thing
More informationOne-Level Cache Memory Design for Scalable SMT Architectures
One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract
More informationISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies
VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group
More informationReminder: tutorials start next week!
Previous lecture recap! Metrics of computer architecture! Fundamental ways of improving performance: parallelism, locality, focus on the common case! Amdahl s Law: speedup proportional only to the affected
More informationVLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier
VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier U.V.N.S.Suhitha Student Department of ECE, BVC College of Engineering, AP, India. Abstract: The ever growing need for improved
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationVector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data
Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationECE 486/586. Computer Architecture. Lecture # 7
ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix
More informationCOMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital
Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital hardware modules that accomplish a specific information-processing task. Digital systems vary in
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationDesign of a Pipelined 32 Bit MIPS Processor with Floating Point Unit
Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit P Ajith Kumar 1, M Vijaya Lakshmi 2 P.G. Student, Department of Electronics and Communication Engineering, St.Martin s Engineering College,
More informationChapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations
Chapter 4 The Processor Part I Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationPipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications
, Vol 7(4S), 34 39, April 204 ISSN (Print): 0974-6846 ISSN (Online) : 0974-5645 Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications B. Vignesh *, K. P. Sridhar
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationMPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors
MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationPrinciples in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008
Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.
More informationData Speculation. Architecture. Carnegie Mellon School of Computer Science
Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A
More informationLaboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication
Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationLecture 25: Busses. A Typical Computer Organization
S 09 L25-1 18-447 Lecture 25: Busses James C. Hoe Dept of ECE, CMU April 27, 2009 Announcements: Project 4 due this week (no late check off) HW 4 due today Handouts: Practice Final Solutions A Typical
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More informationAiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR.
2015; 2(2): 201-209 IJMRD 2015; 2(2): 201-209 www.allsubjectjournal.com Received: 07-01-2015 Accepted: 10-02-2015 E-ISSN: 2349-4182 P-ISSN: 2349-5979 Impact factor: 3.762 Aiyar, Mani Laxman Dept. Of ECE,
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationCS Computer Architecture
CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 Computer Systems Organization The CPU (Central Processing Unit) is the brain of the computer. Fetches instructions from main memory.
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationECE 341 Final Exam Solution
ECE 341 Final Exam Solution Time allowed: 110 minutes Total Points: 100 Points Scored: Name: Problem No. 1 (10 points) For each of the following statements, indicate whether the statement is TRUE or FALSE.
More informationHigh Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC
Journal of Computational Information Systems 7: 8 (2011) 2843-2850 Available at http://www.jofcis.com High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Meihua GU 1,2, Ningmei
More informationScreaming Fast Galois Field Arithmetic Using Intel SIMD Instructions
Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz (and Pure Storage) Authors Jim Plank Univ.
More informationM A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE 6.111 Introductory Digital Systems Laboratory Fall 2017 Lecture PSet #6 of
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More information