A study on SIMD architecture

Size: px

Start display at page:

Download "A study on SIMD architecture"

Eustacia Richards
5 years ago
Views:

1 A study on SIMD architecture Gürkan Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Abstract Single instruction, multiple data (SIMD) architectures became popular with the demanding increase on data streaming applications such as real-time games and video processing. Since modern processors in desktop computers support SIMD instructions with various implementations, we may use these machines for optimizing applications in which we process multiple data with single instructions. In this project, we study the use SIMD architectures and learn their effects on the performance of specific applications. We choose matrix multiplication and Advanced Encryption Standard (AES) encryption algorithms and modify them to exploit the use of SIMD instructions. The performance improvements using the SIMD instructions are analyzed and validated by the experimental study. I. INTRODUCTION Performance in a computer system is defined by the amount of useful work accomplished by the computer system compared to the time and the resources used. There are several aspects for improving performance of computer systems. Researchers from several areas are striving to achieve higher performance ranging from algorithm, compiler, OS, and hardware designers. Nowadays, most CPU designs contain at least some vector processing instructions, typically referred to as SIMD in which typically operate on a few vectors elements per clock cycle in a pipeline. These vector processors run multiple mathematical operations on multiple data elements simultaneously. Thus, they have effects on the performance equation. From the Iron law, we know that performance of a program is calculated by the formula P erformance := IC CP I CT, where IC is the number of instructions (instruction count), CP I is the number of cycles per instruction and CT is the cycle time of the processor. The use of SIMD architecture changes IC and CP I values of a program. With the new enhancements in the processor architectures, the current modern processors started supporting 256- bit vector implementations. Moreover, the interest on SIMD architectures by the computer architecture research community is increasing. In the near future, the new powerful machines are expected to make new high performance multiple data applications available with the enhanced SIMD architectures. In this study, we target the SIMD architecture and its effects on performance of some designed cases. Then, we analyze the performance of our approach by evaluating the speed-up values of the programs using SIMD architecture. One of the problems we challenge is the configuration of compilers to output vector instruction in binary executable file. The other challenge is designing case study for calculating of speed up. For this reason, matrix multiplication and the AES encryption algorithms are chosen as the best candidates in which we need to do several linear binary operations on multiple data. For implementation part of this research, we implement it on single machine with minimum load as test platform for measuring the performance. As a group of graduate students taking the CDA 5106 course, we try to get involved in this challenge by doing research on the SIMD CPU architecture. In this project, we worked in almost all phases together as a team, having regular meetings before and after the milestones. To sum up, Gürkan worked on the implementation and documentation, prepared the final report, made research for the history and related studies. Rouhollah worked on implementation of the optimized algorithms and documentation phases. He also conducted the experiments. Mohammad proposed the main idea, worked on the proposal, documentation for the benchmarks, and analysis of the results. The rest of the paper is organized as follows. Section II briefly summarizes the history of SIMD architectures and the related work. We describe SIMD and some architectures in Section III. We provide a detailed description for our benchmarks in Section IV. The results of the experiments are presented in Section V. We finally conclude in Section VI. II. RELATED WORK Let us briefly discuss the history of the SIMD architectures and the related work in the literature. The first use of SIMD instructions was in early 1970s [1]. As an example, they were used in CDC Star 100 and TI ASC machines which could do the same operation on a bunch of data. This architecture specially became common when Cray Inc. used it in their supercomputers. However, vector processing used in these machines nowadays are considered different from the SIMD machines. Thinking Machines CM-1 and CM-2 which are considered as massively parallel processing-style supercomputers [2], started a new era in using SIMD for processing the data in parallel. However, current researches focus on using SIMD instructions in desktop computers. A lot of tasks desktop computers do these days like video processing and real-time gaming need to do the same operation on a bunch of data. So, companies tried to use this architecture in desktops. As one of the earliest attempts, Sun Microsystems introduced SIMD integer instructions in VIS (visual instruction set) extensions in UltraSPARC I microprocessor in MIPS introduced MDMX (MIPS Digital Media extension). Intel made SIMD widely-used by introducing MMX extensions to the x86 architecture in Then Motorolla introduced AltiVec system in its PowerPC s which also was used in IBM s POWER systems. So, this caused the Intels respond which was introduction of SSE. These days SSE and its extensions are used more than the others.

Fig. 3. Data processing with SISD vs. SIMD. Taken from [6]. Fig. 1. Processor array in a supercomputer. Taken from [5]. Fig. 2. The relationship between processor array and the host computer.

2 Fig. 3. Data processing with SISD vs. SIMD. Taken from [6]. Fig. 1. Processor array in a supercomputer. Taken from [5]. Fig. 2. The relationship between processor array and the host computer. Taken from [5]. There are various studies in the literature which are conducted by research groups or companies which focus on hardware. Holzer-Graf et al. [3] studied the efficient vector implementations of AES-based designs. In this paper three different vector implementations are analyzed and the performance of each of them is compared. The use of chip multiprocessing and the cell broadband engine is described by Gschwind [4]. III. SIMD ARCHITECTURE To understand the improvements being made on SIMD architectures, let us first start with the older SIMD architectures. Earlier versions of SIMD architectures are proposed for supercomputers [1] which have a number of processing units (elements) and a control unit. In these machines, the processing units (PUs) are the pieces which make the computation while the control unit controls these array of processing elements (PEs). The single control unit is generally responsible for reading the instructions, decoding the instructions and sending control signals to the PEs. Data are supplied to PEs by a memory. In this architecture, the number of data paths from the memory to the PEs is equal to the number of PEs. The supercomputers also have an specific interconnection networks. The interconnection networks provide flexibility for data from and to PEs with high performance. They also have an I/O system which have differences from one machine to another. Figure 1 illustrates processor array architectures in supercomputers including memory modules, interconnection network, PEs, control unit and the I/O system. In some supercomputers, processing elements are controlled by a host computer, which is illustrated in Figure 2. Supercomputers which are categorized as multiple instruction, multiple data (MIMD) in Flynn s taxonomy is became popular and this caused the reduce in the interest in SIMD machines for some period of time. Enhancements on the desktop computers lead to powerful machines which are strong enough to handle applications such as video processing. Therefore, the SIMD architectures again became popular in 1990s. SIMD architectures exploit a property of data stream called as data parallelism. SIMD computing is also known as vector processing, considering the row of data coming to the processor as vectors of data. It is almost impossible to have applications which are purely parallel and the pure use of SIMD computing is not possible. Hence, in applications of SIMD computing, the programs are written for single instruction, single data (SISD) machines and they include SIMD instructions. The proportion of sequential part and the SIMD part in the program determines the maximum speed-up according to Amdahl s law. Figure 3 shows the data exploitation by a SIMD machine which processes 3 vectors at the same time, compared to an SISD machine which is able to process one row of data. Length of vectors in a SIMD processor determines the number of elements of a given data type. For instance, a 128-bit vector implementation in a processor allows us to do four-way single-precision floatingpoint operations. We may categorize the SIMD operations by their types. The one obvious operation type is the intra-element arithmetic and non-arithmetic operations. Addition, multiplication, and substraction are example arithmetic operations, while AN D,

Fig. 4. [6]. Intra-element arithmetic and non-arithmetic operations. Taken from Fig. 6. AltiVec architecture with 4 distinct registers. Taken from [6]. Fig. 5.

Figure 4 illustrates the intra-element operations with two source vectors V A and V B and a destination vector V T. Each of these vectors contain 4 registers with 32-bit.

3 Fig. 4. [6]. Intra-element arithmetic and non-arithmetic operations. Taken from Fig. 6. AltiVec architecture with 4 distinct registers. Taken from [6]. Fig. 5. Inter-element operations between the elements of a vector. Taken from [6]. XOR, and OR are examples of non-arithmetic operations. Figure 4 illustrates the intra-element operations with two source vectors V A and V B and a destination vector V T. Each of these vectors contain 4 registers with 32-bit. This means we can have operations on 2 vectors and each of the vectors include 4 integers or floating points. Similarly, we can process two vectors with 2 double values inside each of them. The other operation type is inter-element arithmetic and non-arithmetic operations. This operations are between the elements of a single vector. Vector permutes, logical shifts are example inter-element operations 5. To understand the modern SIMD architectures better, let us now briefly discuss some of them. We start with AltiVec architecture which is used by many companies including Apple and Motorola. The illustration of the AltiVec architecture is shown in Figure 6. As one can see from this figure, there are 4 distinct set of registers in AltiVec. Vectors V A and V B are the source vectors while vector V T is the destination vector. Source registers hold the operands while the destination registers hold the value of the result. The new vector V C is called as filter or modifier. This vector is useful for many operations including vector permutations. In vector permutation operation, the vector V C holds the values to show the elements in each of the vectors V A and V B their new locations in the destination vector V T. An illustration of this procedure can be seen in Figure 7. Intel introduced MMX/SSE architecture which is capable of SIMD operations for integers and floating points with the new MMX series of processors. They used Katmai new instructions (KNI) for the new processors. As in Figure 9, they added bit new registers and called the new ones SSE registers. Fig. 7. Vector permutation with AltiVec architecture. Taken from [6]. They had a floating point unit along with the SIMD units in their processors. They also had a switch mechanism in their processors, which allow the processor to change its mode from SISD to SIMD. Figure 8 shows the architecture of a 128-bit Intel MMX/SSE architecture. The 128-bit processor is considered as a combination of 2 64-bit processing units. This enables to do two different operations at the same time with these processors. Nowadays, Intel s 256-bit processor for SIMD architecture is available. Here, we finish our discussion for SIMD architectures by giving significant architecture examples and come to the next level in the study. The main idea of this study is to use these architectures with convenient benchmarks to evaluate their performance improvements for various programs. Therefore, the benchmarks which are used in the study are described in detail in the next section. IV. MATRIX MULTIPLICATION AND AES There is no doubt that future processors will differ significantly from the current designs and will reshape the way of thinking about programming. SIMD is one of the most important advancements of modern CPUs. This research deals with SIMD and how you could use it to increase performance of programs. The main goal in our research is to use SIMD instruction set for improving programs performance. The first problem we encountered is that there is no SIMD implemented benchmarks for assessing the performance. So, we created our own tools to assess the improvement in performance. In the next part

Fig. 8. Intel MMX/SSE architecture. Taken from [6]. Fig. 10. Ordinary implementation of matrix multiplication. Fig. 11. Matrix multiplication with just 3 SIMD instructions. Fig. 9. [6]. 128-bit Intel MMX/SSE architecture with SSE registers.

4 Fig. 8. Intel MMX/SSE architecture. Taken from [6]. Fig. 10. Ordinary implementation of matrix multiplication. Fig. 11. Matrix multiplication with just 3 SIMD instructions. Fig. 9. [6]. 128-bit Intel MMX/SSE architecture with SSE registers. Taken from we will briefly describe the types of vector operations were available for the implementation [7]. V V Example: Complement all elements S V Examples: Min, Max, Sum V V x V Examples: Vector addition, multiplication, division V V x S Examples: Multiply or add a scalar to a vector S V x V Example: Calculate an element of a matrix We choose matrix multiplication and AES for the benchmarks. There are two reasons behind that. First, they are being used in many applications in desktop computers and also mobile devices these days. Second, they can be optimized by SIMD instructions because they do some instructions on a bunch of data. Now, we briefly describe the benchmarks. We chose matrix multiplication and AES for the benchmarks. A. Matrix multiplication Matrix multiplication is one of the best candidates to be optimized by SIMD instruction, because it deals with matrices which consist of arrays. Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra. It forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. Figure 10 shows a naive implementation of matrix multiplication with so many scalar instructions. By applying SIMD instructions, the number of executed instructions can decrease. Also, we will achieve noticeable performance improvement. Figure 11 shows the listing of code with SIMD instruction set. B. AES encryption algorithm AES is other candidate for implementation by SIMD instructions. It has vast applications everywhere in computer systems from mobile device to distributed data centers. AES is a block cipher algorithm that has been selected by the U.S. National Institute of Standards and Technology (NIST). It was selected by contest from a list of five finalists that were themselves selected from an original list of more than 15 submissions. AES will begin to supplant the Data Encryption Standard (DES) - and the later Triple DES - over the next few years in many cryptography applications. The algorithm was designed by two Belgian cryptologists, Vincent Rijmen and Joan Daemen [8], whose surnames are reflected in the cipher s name. The cipher has a variable block length and key length. The authors currently try to specify how to use keys with a length of 128, 192 or 256 bits, to encrypt blocks with a length of 128, 192 or 256 bits (all nine combinations of key length and block length are possible). Both block length and key length can be extended very easily to multiples of 32 bits. Figure 12 illustrates the flow chart of the algorithm that we use. AES can be implemented very efficiently on a wide range of processors. We have implemented it using the SIMD instructions.

Fig. 12. Flow chart for the AES algorithm. Fig. 14. Time to do 512x512 multiplication in milliseconds. Fig. 13. [9]. Performance gains using streaming SIMD extensions. Taken from A. Implementation V.

First, it is possible to implement the algorithm using the normal instructions and use a compiler which automatically converts the instructions to SIMD instruction wherever it is possible.

In the first method, Microsoft Visual Studio is not able to do this task at this time, but, Intel has proposed a compiler which is capable of doing this.

5 Fig. 12. Flow chart for the AES algorithm. Fig. 14. Time to do 512x512 multiplication in milliseconds. Fig. 13. [9]. Performance gains using streaming SIMD extensions. Taken from A. Implementation V. EXPERIMENTAL STUDY Basically, there are two ways to implement an algorithm using SIMD instructions. First, it is possible to implement the algorithm using the normal instructions and use a compiler which automatically converts the instructions to SIMD instruction wherever it is possible. Second, to optimize the algorithm by writing the SIMD instructions manually. In the first method, Microsoft Visual Studio is not able to do this task at this time, but, Intel has proposed a compiler which is capable of doing this. However, the most optimized version can be achieved by manually implementing an algorithm and considering the most proper instructions for the implementation. So, we chose the second method which gives the best result. We implemented applications using C++ language and used Microsoft Visual Studio to compile them. We disabled all the compiler optimization features to make sure that the optimization is only because of using SIMD instructions, not compiler optimizations like loop unrolling. For having an acceptable estimation of number of cycles the algorithms need to be executed, we closed all unnecessary processes and ran the implemented function times. Then, we divided the number of cycles achieved by the number of executions of the function. So, we found the average number of cycles needed for the algorithm to be executed. B. Results First benchmark was matrix multiplication. Table 13 shows the results of its implementation by Intel. Most of the imple- Fig. 15. AES encryption (cycles/byte). mentations of matrix multiplication is doing the task on fixed dimension matrices. This makes the algorithm faster, because they do not use the loops which have their own overhead and cannot be written using SIMD instructions. However, our implementation of matrix multiplication can do this task on variable matrix dimensions. Considering this point, our optimization is still comparable to the Intel implementation. Figure 14 compares the result of our implementation of a naive implementation of matrix multiplication with the original algorithm. It is about 2 times faster than the original algorithm. Figure 15 shows the result of our optimization of AES encryption which is about 23 times faster. We will give some explanation about these results. First of all, by using SIMD instructions, wherever we replace a normal instruction with an SIMD one we decrease the number of instructions by a factor of four. However, not all the instructions are not replaceable by SIMD instructions. In addition, in average, number of cycles needed for the SIMD instructions to be executed is more than the normal instructions for doing the same task. But, considering the Irons law, decrease in number of instructions will affect the final performance more than increase in the average number of cycles for instructions to be executed. So, successful optimization of an algorithm depends

6 on two factors. First, how many instructions are replaceable by SIMD instructions. Second, what combination of instructions are used in the SIMD implementation. The second is important because different SIMD instructions need different number of cycles to be executed. VI. CONCLUSION In this paper, we studied the SIMD architectures. We summarized the history of these architectures and explained some of the new architectures. We proposed use of SIMD for two applications, matrix multiplication and AES. The operations which are done for matrix multiplication and the AES algorithm is described in detail. The experimental results showed that the use of SIMD significantly improves the performance of each of the programs. REFERENCES [1] Wikipedia, SIMD. 2013, [Accessed 15-April-2013]. [Online]. Available: [2] D. A. Patterson and J. L. Hennessey, Computer Organization and Design: the Hardware/Software Interface., [3] T. K. M. P. M. S. P. S. D. S. W. W. Holzer-Graf, Severin, Efficient vector implementations of AES-Based designs: A case study and new implemenations for Grstl. Topics in Cryptology CTRSA, pp , [4] M. Gschwind, Chip multiprocessing and the cell broadband engine. Computing Frontiers, [5] L. W. Harold W. Lawson, Bertil Svensson, Parallel processing in industrial real-time applications. Prentice Hall, [6] J. Stokes, SIMD architectures. Ars Technica, March [7] I. Corp., Intel advanced vector extensions programming reference. June [8] J. Daemen and V. Rijmen, The design of Rijndael: AES-the advanced encryption standard. Springer, [9] I. Corp., Streaming SIMD extensions - matrix multiplication. AP-930, June 1999.

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing