A study on SIMD architecture
|
|
- Eustacia Richards
- 5 years ago
- Views:
Transcription
1 A study on SIMD architecture Gürkan Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Abstract Single instruction, multiple data (SIMD) architectures became popular with the demanding increase on data streaming applications such as real-time games and video processing. Since modern processors in desktop computers support SIMD instructions with various implementations, we may use these machines for optimizing applications in which we process multiple data with single instructions. In this project, we study the use SIMD architectures and learn their effects on the performance of specific applications. We choose matrix multiplication and Advanced Encryption Standard (AES) encryption algorithms and modify them to exploit the use of SIMD instructions. The performance improvements using the SIMD instructions are analyzed and validated by the experimental study. I. INTRODUCTION Performance in a computer system is defined by the amount of useful work accomplished by the computer system compared to the time and the resources used. There are several aspects for improving performance of computer systems. Researchers from several areas are striving to achieve higher performance ranging from algorithm, compiler, OS, and hardware designers. Nowadays, most CPU designs contain at least some vector processing instructions, typically referred to as SIMD in which typically operate on a few vectors elements per clock cycle in a pipeline. These vector processors run multiple mathematical operations on multiple data elements simultaneously. Thus, they have effects on the performance equation. From the Iron law, we know that performance of a program is calculated by the formula P erformance := IC CP I CT, where IC is the number of instructions (instruction count), CP I is the number of cycles per instruction and CT is the cycle time of the processor. The use of SIMD architecture changes IC and CP I values of a program. With the new enhancements in the processor architectures, the current modern processors started supporting 256- bit vector implementations. Moreover, the interest on SIMD architectures by the computer architecture research community is increasing. In the near future, the new powerful machines are expected to make new high performance multiple data applications available with the enhanced SIMD architectures. In this study, we target the SIMD architecture and its effects on performance of some designed cases. Then, we analyze the performance of our approach by evaluating the speed-up values of the programs using SIMD architecture. One of the problems we challenge is the configuration of compilers to output vector instruction in binary executable file. The other challenge is designing case study for calculating of speed up. For this reason, matrix multiplication and the AES encryption algorithms are chosen as the best candidates in which we need to do several linear binary operations on multiple data. For implementation part of this research, we implement it on single machine with minimum load as test platform for measuring the performance. As a group of graduate students taking the CDA 5106 course, we try to get involved in this challenge by doing research on the SIMD CPU architecture. In this project, we worked in almost all phases together as a team, having regular meetings before and after the milestones. To sum up, Gürkan worked on the implementation and documentation, prepared the final report, made research for the history and related studies. Rouhollah worked on implementation of the optimized algorithms and documentation phases. He also conducted the experiments. Mohammad proposed the main idea, worked on the proposal, documentation for the benchmarks, and analysis of the results. The rest of the paper is organized as follows. Section II briefly summarizes the history of SIMD architectures and the related work. We describe SIMD and some architectures in Section III. We provide a detailed description for our benchmarks in Section IV. The results of the experiments are presented in Section V. We finally conclude in Section VI. II. RELATED WORK Let us briefly discuss the history of the SIMD architectures and the related work in the literature. The first use of SIMD instructions was in early 1970s [1]. As an example, they were used in CDC Star 100 and TI ASC machines which could do the same operation on a bunch of data. This architecture specially became common when Cray Inc. used it in their supercomputers. However, vector processing used in these machines nowadays are considered different from the SIMD machines. Thinking Machines CM-1 and CM-2 which are considered as massively parallel processing-style supercomputers [2], started a new era in using SIMD for processing the data in parallel. However, current researches focus on using SIMD instructions in desktop computers. A lot of tasks desktop computers do these days like video processing and real-time gaming need to do the same operation on a bunch of data. So, companies tried to use this architecture in desktops. As one of the earliest attempts, Sun Microsystems introduced SIMD integer instructions in VIS (visual instruction set) extensions in UltraSPARC I microprocessor in MIPS introduced MDMX (MIPS Digital Media extension). Intel made SIMD widely-used by introducing MMX extensions to the x86 architecture in Then Motorolla introduced AltiVec system in its PowerPC s which also was used in IBM s POWER systems. So, this caused the Intels respond which was introduction of SSE. These days SSE and its extensions are used more than the others.
2 Fig. 3. Data processing with SISD vs. SIMD. Taken from [6]. Fig. 1. Processor array in a supercomputer. Taken from [5]. Fig. 2. The relationship between processor array and the host computer. Taken from [5]. There are various studies in the literature which are conducted by research groups or companies which focus on hardware. Holzer-Graf et al. [3] studied the efficient vector implementations of AES-based designs. In this paper three different vector implementations are analyzed and the performance of each of them is compared. The use of chip multiprocessing and the cell broadband engine is described by Gschwind [4]. III. SIMD ARCHITECTURE To understand the improvements being made on SIMD architectures, let us first start with the older SIMD architectures. Earlier versions of SIMD architectures are proposed for supercomputers [1] which have a number of processing units (elements) and a control unit. In these machines, the processing units (PUs) are the pieces which make the computation while the control unit controls these array of processing elements (PEs). The single control unit is generally responsible for reading the instructions, decoding the instructions and sending control signals to the PEs. Data are supplied to PEs by a memory. In this architecture, the number of data paths from the memory to the PEs is equal to the number of PEs. The supercomputers also have an specific interconnection networks. The interconnection networks provide flexibility for data from and to PEs with high performance. They also have an I/O system which have differences from one machine to another. Figure 1 illustrates processor array architectures in supercomputers including memory modules, interconnection network, PEs, control unit and the I/O system. In some supercomputers, processing elements are controlled by a host computer, which is illustrated in Figure 2. Supercomputers which are categorized as multiple instruction, multiple data (MIMD) in Flynn s taxonomy is became popular and this caused the reduce in the interest in SIMD machines for some period of time. Enhancements on the desktop computers lead to powerful machines which are strong enough to handle applications such as video processing. Therefore, the SIMD architectures again became popular in 1990s. SIMD architectures exploit a property of data stream called as data parallelism. SIMD computing is also known as vector processing, considering the row of data coming to the processor as vectors of data. It is almost impossible to have applications which are purely parallel and the pure use of SIMD computing is not possible. Hence, in applications of SIMD computing, the programs are written for single instruction, single data (SISD) machines and they include SIMD instructions. The proportion of sequential part and the SIMD part in the program determines the maximum speed-up according to Amdahl s law. Figure 3 shows the data exploitation by a SIMD machine which processes 3 vectors at the same time, compared to an SISD machine which is able to process one row of data. Length of vectors in a SIMD processor determines the number of elements of a given data type. For instance, a 128-bit vector implementation in a processor allows us to do four-way single-precision floatingpoint operations. We may categorize the SIMD operations by their types. The one obvious operation type is the intra-element arithmetic and non-arithmetic operations. Addition, multiplication, and substraction are example arithmetic operations, while AN D,
3 Fig. 4. [6]. Intra-element arithmetic and non-arithmetic operations. Taken from Fig. 6. AltiVec architecture with 4 distinct registers. Taken from [6]. Fig. 5. Inter-element operations between the elements of a vector. Taken from [6]. XOR, and OR are examples of non-arithmetic operations. Figure 4 illustrates the intra-element operations with two source vectors V A and V B and a destination vector V T. Each of these vectors contain 4 registers with 32-bit. This means we can have operations on 2 vectors and each of the vectors include 4 integers or floating points. Similarly, we can process two vectors with 2 double values inside each of them. The other operation type is inter-element arithmetic and non-arithmetic operations. This operations are between the elements of a single vector. Vector permutes, logical shifts are example inter-element operations 5. To understand the modern SIMD architectures better, let us now briefly discuss some of them. We start with AltiVec architecture which is used by many companies including Apple and Motorola. The illustration of the AltiVec architecture is shown in Figure 6. As one can see from this figure, there are 4 distinct set of registers in AltiVec. Vectors V A and V B are the source vectors while vector V T is the destination vector. Source registers hold the operands while the destination registers hold the value of the result. The new vector V C is called as filter or modifier. This vector is useful for many operations including vector permutations. In vector permutation operation, the vector V C holds the values to show the elements in each of the vectors V A and V B their new locations in the destination vector V T. An illustration of this procedure can be seen in Figure 7. Intel introduced MMX/SSE architecture which is capable of SIMD operations for integers and floating points with the new MMX series of processors. They used Katmai new instructions (KNI) for the new processors. As in Figure 9, they added bit new registers and called the new ones SSE registers. Fig. 7. Vector permutation with AltiVec architecture. Taken from [6]. They had a floating point unit along with the SIMD units in their processors. They also had a switch mechanism in their processors, which allow the processor to change its mode from SISD to SIMD. Figure 8 shows the architecture of a 128-bit Intel MMX/SSE architecture. The 128-bit processor is considered as a combination of 2 64-bit processing units. This enables to do two different operations at the same time with these processors. Nowadays, Intel s 256-bit processor for SIMD architecture is available. Here, we finish our discussion for SIMD architectures by giving significant architecture examples and come to the next level in the study. The main idea of this study is to use these architectures with convenient benchmarks to evaluate their performance improvements for various programs. Therefore, the benchmarks which are used in the study are described in detail in the next section. IV. MATRIX MULTIPLICATION AND AES There is no doubt that future processors will differ significantly from the current designs and will reshape the way of thinking about programming. SIMD is one of the most important advancements of modern CPUs. This research deals with SIMD and how you could use it to increase performance of programs. The main goal in our research is to use SIMD instruction set for improving programs performance. The first problem we encountered is that there is no SIMD implemented benchmarks for assessing the performance. So, we created our own tools to assess the improvement in performance. In the next part
4 Fig. 8. Intel MMX/SSE architecture. Taken from [6]. Fig. 10. Ordinary implementation of matrix multiplication. Fig. 11. Matrix multiplication with just 3 SIMD instructions. Fig. 9. [6]. 128-bit Intel MMX/SSE architecture with SSE registers. Taken from we will briefly describe the types of vector operations were available for the implementation [7]. V V Example: Complement all elements S V Examples: Min, Max, Sum V V x V Examples: Vector addition, multiplication, division V V x S Examples: Multiply or add a scalar to a vector S V x V Example: Calculate an element of a matrix We choose matrix multiplication and AES for the benchmarks. There are two reasons behind that. First, they are being used in many applications in desktop computers and also mobile devices these days. Second, they can be optimized by SIMD instructions because they do some instructions on a bunch of data. Now, we briefly describe the benchmarks. We chose matrix multiplication and AES for the benchmarks. A. Matrix multiplication Matrix multiplication is one of the best candidates to be optimized by SIMD instruction, because it deals with matrices which consist of arrays. Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra. It forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. Figure 10 shows a naive implementation of matrix multiplication with so many scalar instructions. By applying SIMD instructions, the number of executed instructions can decrease. Also, we will achieve noticeable performance improvement. Figure 11 shows the listing of code with SIMD instruction set. B. AES encryption algorithm AES is other candidate for implementation by SIMD instructions. It has vast applications everywhere in computer systems from mobile device to distributed data centers. AES is a block cipher algorithm that has been selected by the U.S. National Institute of Standards and Technology (NIST). It was selected by contest from a list of five finalists that were themselves selected from an original list of more than 15 submissions. AES will begin to supplant the Data Encryption Standard (DES) - and the later Triple DES - over the next few years in many cryptography applications. The algorithm was designed by two Belgian cryptologists, Vincent Rijmen and Joan Daemen [8], whose surnames are reflected in the cipher s name. The cipher has a variable block length and key length. The authors currently try to specify how to use keys with a length of 128, 192 or 256 bits, to encrypt blocks with a length of 128, 192 or 256 bits (all nine combinations of key length and block length are possible). Both block length and key length can be extended very easily to multiples of 32 bits. Figure 12 illustrates the flow chart of the algorithm that we use. AES can be implemented very efficiently on a wide range of processors. We have implemented it using the SIMD instructions.
5 Fig. 12. Flow chart for the AES algorithm. Fig. 14. Time to do 512x512 multiplication in milliseconds. Fig. 13. [9]. Performance gains using streaming SIMD extensions. Taken from A. Implementation V. EXPERIMENTAL STUDY Basically, there are two ways to implement an algorithm using SIMD instructions. First, it is possible to implement the algorithm using the normal instructions and use a compiler which automatically converts the instructions to SIMD instruction wherever it is possible. Second, to optimize the algorithm by writing the SIMD instructions manually. In the first method, Microsoft Visual Studio is not able to do this task at this time, but, Intel has proposed a compiler which is capable of doing this. However, the most optimized version can be achieved by manually implementing an algorithm and considering the most proper instructions for the implementation. So, we chose the second method which gives the best result. We implemented applications using C++ language and used Microsoft Visual Studio to compile them. We disabled all the compiler optimization features to make sure that the optimization is only because of using SIMD instructions, not compiler optimizations like loop unrolling. For having an acceptable estimation of number of cycles the algorithms need to be executed, we closed all unnecessary processes and ran the implemented function times. Then, we divided the number of cycles achieved by the number of executions of the function. So, we found the average number of cycles needed for the algorithm to be executed. B. Results First benchmark was matrix multiplication. Table 13 shows the results of its implementation by Intel. Most of the imple- Fig. 15. AES encryption (cycles/byte). mentations of matrix multiplication is doing the task on fixed dimension matrices. This makes the algorithm faster, because they do not use the loops which have their own overhead and cannot be written using SIMD instructions. However, our implementation of matrix multiplication can do this task on variable matrix dimensions. Considering this point, our optimization is still comparable to the Intel implementation. Figure 14 compares the result of our implementation of a naive implementation of matrix multiplication with the original algorithm. It is about 2 times faster than the original algorithm. Figure 15 shows the result of our optimization of AES encryption which is about 23 times faster. We will give some explanation about these results. First of all, by using SIMD instructions, wherever we replace a normal instruction with an SIMD one we decrease the number of instructions by a factor of four. However, not all the instructions are not replaceable by SIMD instructions. In addition, in average, number of cycles needed for the SIMD instructions to be executed is more than the normal instructions for doing the same task. But, considering the Irons law, decrease in number of instructions will affect the final performance more than increase in the average number of cycles for instructions to be executed. So, successful optimization of an algorithm depends
6 on two factors. First, how many instructions are replaceable by SIMD instructions. Second, what combination of instructions are used in the SIMD implementation. The second is important because different SIMD instructions need different number of cycles to be executed. VI. CONCLUSION In this paper, we studied the SIMD architectures. We summarized the history of these architectures and explained some of the new architectures. We proposed use of SIMD for two applications, matrix multiplication and AES. The operations which are done for matrix multiplication and the AES algorithm is described in detail. The experimental results showed that the use of SIMD significantly improves the performance of each of the programs. REFERENCES [1] Wikipedia, SIMD. 2013, [Accessed 15-April-2013]. [Online]. Available: [2] D. A. Patterson and J. L. Hennessey, Computer Organization and Design: the Hardware/Software Interface., [3] T. K. M. P. M. S. P. S. D. S. W. W. Holzer-Graf, Severin, Efficient vector implementations of AES-Based designs: A case study and new implemenations for Grstl. Topics in Cryptology CTRSA, pp , [4] M. Gschwind, Chip multiprocessing and the cell broadband engine. Computing Frontiers, [5] L. W. Harold W. Lawson, Bertil Svensson, Parallel processing in industrial real-time applications. Prentice Hall, [6] J. Stokes, SIMD architectures. Ars Technica, March [7] I. Corp., Intel advanced vector extensions programming reference. June [8] J. Daemen and V. Rijmen, The design of Rijndael: AES-the advanced encryption standard. Springer, [9] I. Corp., Streaming SIMD extensions - matrix multiplication. AP-930, June 1999.
Unit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationCMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss
CMPE 655 Multiple Processor Systems SIMD/Vector Machines Daniel Terrance Stephen Charles Rajkumar Ramadoss SIMD Machines - Introduction Computers with an array of multiple processing elements (PE). Similar
More informationFAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH
Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska
More informationComputer Architecture and Organization
10-1 Chapter 10 - Advanced Computer Architecture Computer Architecture and Organization Miles Murdocca and Vincent Heuring Chapter 10 Advanced Computer Architecture 10-2 Chapter 10 - Advanced Computer
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More informationModule 5 Introduction to Parallel Processing Systems
Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationHakam Zaidan Stephen Moore
Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction
More informationParallel Processors. Session 1 Introduction
Parallel Processors Session 1 Introduction Applications of Parallel Processors Structural Analysis Weather Forecasting Petroleum Exploration Fusion Energy Research Medical Diagnosis Aerodynamics Simulations
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationRISC Processors and Parallel Processing. Section and 3.3.6
RISC Processors and Parallel Processing Section 3.3.5 and 3.3.6 The Control Unit When a program is being executed it is actually the CPU receiving and executing a sequence of machine code instructions.
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationChapter 1. Introduction To Computer Systems
Chapter 1 Introduction To Computer Systems 1.1 Historical Background The first program-controlled computer ever built was the Z1 (1938). This was followed in 1939 by the Z2 as the first operational program-controlled
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationIntroduction to Cryptology. Lecture 17
Introduction to Cryptology Lecture 17 Announcements HW7 due Thursday 4/7 Looking ahead: Practical constructions of CRHF Start Number Theory background Agenda Last time SPN (6.2) This time Feistel Networks
More informationHigh-Performance Cryptography in Software
High-Performance Cryptography in Software Peter Schwabe Research Center for Information Technology Innovation Academia Sinica September 3, 2012 ECRYPT Summer School: Challenges in Security Engineering
More informationMulti-core Programming - Introduction
Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationArchitectures of Flynn s taxonomy -- A Comparison of Methods
Architectures of Flynn s taxonomy -- A Comparison of Methods Neha K. Shinde Student, Department of Electronic Engineering, J D College of Engineering and Management, RTM Nagpur University, Maharashtra,
More informationParallel computer architecture classification
Parallel computer architecture classification Hardware Parallelism Computing: execute instructions that operate on data. Computer Instructions Data Flynn s taxonomy (Michael Flynn, 1967) classifies computer
More informationProgrammation Concurrente (SE205)
Programmation Concurrente (SE205) CM1 - Introduction to Parallelism Florian Brandner & Laurent Pautet LTCI, Télécom ParisTech, Université Paris-Saclay x Outline Course Outline CM1: Introduction Forms of
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationCSE 260 Introduction to Parallel Computation
CSE 260 Introduction to Parallel Computation Larry Carter carter@cs.ucsd.edu Office Hours: AP&M 4101 MW 10:00-11 or by appointment 9/20/2001 Topics Instances Principles Theory Hardware specific machines
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationChapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors
Chapter 06: Instruction Pipelining and Parallel Processing Lesson 14: Example of the Pipelined CISC and RISC Processors 1 Objective To understand pipelines and parallel pipelines in CISC and RISC Processors
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationComputer Organization and Design, 5th Edition: The Hardware/Software Interface
Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program
More informationEstimating Multimedia Instruction Performance Based on Workload Characterization and Measurement
Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Adil Gheewala*, Jih-Kwon Peir*, Yen-Kuang Chen**, Konrad Lai** *Department of CISE, University of Florida,
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationAdvanced Topics in Computer Architecture
Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationPart VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1
Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture:
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationCycles Per Instruction For This Microprocessor
What Is The Average Number Of Machine Cycles Per Instruction For This Microprocessor Wikipedia's Instructions per second page says that an i7 3630QM deliver ~110,000 It does reduce the number of "wasted"
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationCS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions
CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and
More informationLecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang
Lecture 25: Interrupt Handling and Multi-Data Processing Spring 2018 Jason Tang 1 Topics Interrupt handling Vector processing Multi-data processing 2 I/O Communication Software needs to know when: I/O
More informationCS420/CSE 402/ECE 492. Introduction to Parallel Programming for Scientists and Engineers. Spring 2006
CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists and Engineers Spring 2006 1 of 28 Additional Foils 0.i: Course organization 2 of 28 Instructor: David Padua. 4227 SC padua@uiuc.edu
More informationARCHITECTURES FOR PARALLEL COMPUTATION
Datorarkitektur Fö 11/12-1 Datorarkitektur Fö 11/12-2 Why Parallel Computation? ARCHITECTURES FOR PARALLEL COMTATION 1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures
More informationSAE5C Computer Organization and Architecture. Unit : I - V
SAE5C Computer Organization and Architecture Unit : I - V UNIT-I Evolution of Pentium and Power PC Evolution of Computer Components functions Interconnection Bus Basics of PCI Memory:Characteristics,Hierarchy
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationDan Stafford, Justine Bonnot
Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing
More informationCSC2/458 Parallel and Distributed Systems Machines and Models
CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics Outline Recap Scalability Taxonomy
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationChapter 2 Logic Gates and Introduction to Computer Architecture
Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are
More informationAn Introduction to Parallel Architectures
An Introduction to Parallel Architectures Andrea Marongiu a.marongiu@unibo.it Impact of Parallel Architectures From cell phones to supercomputers In regular CPUs as well as GPUs Parallel HW Processing
More informationQuiz for Chapter 1 Computer Abstractions and Technology
Date: Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationEmbedded Systems Architecture. Computer Architectures
Embedded Systems Architecture Computer Architectures M. Eng. Mariusz Rudnicki 1/18 A taxonomy of computer architectures There are many different types of architectures, and it is worth considering some
More informationFundamentals of Computer Design
CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining
More informationLecture Topics. Announcements. Today: The MIPS ISA (P&H ) Next: continued. Milestone #1 (due 1/26) Milestone #2 (due 2/2)
Lecture Topics Today: The MIPS ISA (P&H 2.1-2.14) Next: continued 1 Announcements Milestone #1 (due 1/26) Milestone #2 (due 2/2) Milestone #3 (due 2/9) 2 1 Evolution of Computing Machinery To understand
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationEE382 Processor Design. Concurrent Processors
EE382 Processor Design Winter 1998-99 Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and Vector Processors Slide 1 Concurrent Processors Vector processors SIMD and small clustered
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,
More information3.3 Hardware Parallel processing
Parallel processing is the simultaneous use of more than one CPU to execute a program. Ideally, parallel processing makes a program run faster because there are more CPUs running it. In practice, it is
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationCISC Attributes. E.g. Pentium is considered a modern CISC processor
What is CISC? CISC means Complex Instruction Set Computer chips that are easy to program and which make efficient use of memory. Since the earliest machines were programmed in assembly language and memory
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationReview of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.
CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 1 Review of Last Lecture Amdahl s Law limits benefits of parallelization Request Level Parallelism
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationCS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions
CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 3/08/2013 Spring 2013 Lecture #19 1 Review of Last Lecture Amdahl s Law limits benefits
More informationStructure of Computer Systems
Structure of Computer Systems Structure of Computer Systems Baruch Zoltan Francisc Technical University of Cluj-Napoca Computer Science Department U. T. PRES Cluj-Napoca, 2002 CONTENTS PREFACE... xiii
More informationDesign of an Efficient Architecture for Advanced Encryption Standard Algorithm Using Systolic Structures
Design of an Efficient Architecture for Advanced Encryption Standard Algorithm Using Systolic Structures 1 Suresh Sharma, 2 T S B Sudarshan 1 Student, Computer Science & Engineering, IIT, Khragpur 2 Assistant
More informationPrinciples of Computer Architecture. Chapter 10: Trends in Computer. Principles of Computer Architecture by M. Murdocca and V.
10-1 Principles of Computer Architecture Miles Murdocca and Vincent Heuring Chapter 10: Trends in Computer Architecture 10-2 Chapter Contents 10.1 Quantitative Analyses of Program Execution 10.2 From CISC
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationComputer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow
More informationIntroduction to Parallel Processing
Babylon University College of Information Technology Software Department Introduction to Parallel Processing By Single processor supercomputers have achieved great speeds and have been pushing hardware
More informationComputer Architecture
Computer Architecture Computer Architecture Hardware INFO 2603 Platform Technologies Week 1: 04-Sept-2018 Computer architecture refers to the overall design of the physical parts of a computer. It examines:
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationFrom CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations
1 CISC Creates the Anti CISC Revolution Digital Equipment Company (DEC) introduces VAX (1977) Commercially successful 32-bit CISC minicomputer From CISC to RISC In 1970s and 1980s CISC minicomputers became
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationImplementation of the block cipher Rijndael using Altera FPGA
Regular paper Implementation of the block cipher Rijndael using Altera FPGA Piotr Mroczkowski Abstract A short description of the block cipher Rijndael is presented. Hardware implementation by means of
More informationFLYNN S TAXONOMY OF COMPUTER ARCHITECTURE
FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE The most popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn s classification scheme is based on the notion of a stream of information.
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More information3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:
BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General
More informationRAID 0 (non-redundant) RAID Types 4/25/2011
Exam 3 Review COMP375 Topics I/O controllers chapter 7 Disk performance section 6.3-6.4 RAID section 6.2 Pipelining section 12.4 Superscalar chapter 14 RISC chapter 13 Parallel Processors chapter 18 Security
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationLecture 8: RISC & Parallel Computers. Parallel computers
Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer
More informationAdvanced Encryption Standard and Modes of Operation. Foundations of Cryptography - AES pp. 1 / 50
Advanced Encryption Standard and Modes of Operation Foundations of Cryptography - AES pp. 1 / 50 AES Advanced Encryption Standard (AES) is a symmetric cryptographic algorithm AES has been originally requested
More informationBlock Ciphers. Lucifer, DES, RC5, AES. CS 470 Introduction to Applied Cryptography. Ali Aydın Selçuk. CS470, A.A.Selçuk Block Ciphers 1
Block Ciphers Lucifer, DES, RC5, AES CS 470 Introduction to Applied Cryptography Ali Aydın Selçuk CS470, A.A.Selçuk Block Ciphers 1 ... Block Ciphers & S-P Networks Block Ciphers: Substitution ciphers
More informationCryptography and Network Security
Cryptography and Network Security Spring 2012 http://users.abo.fi/ipetre/crypto/ Lecture 6: Advanced Encryption Standard (AES) Ion Petre Department of IT, Åbo Akademi University 1 Origin of AES 1999: NIST
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationPerformance, Power, Die Yield. CS301 Prof Szajda
Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationFigure 1-1. A multilevel machine.
1 INTRODUCTION 1 Level n Level 3 Level 2 Level 1 Virtual machine Mn, with machine language Ln Virtual machine M3, with machine language L3 Virtual machine M2, with machine language L2 Virtual machine M1,
More information