Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Size: px
Start display at page:

Download "Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data"

Transcription

1 Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions. Terminology: Scalar a single quantity (number). Vector an ordered series of scalar quantities a one-dimensional array. Scalar Quantity Data Vector Quantity Data Data Data Data Data Data Data Data Five basic types of vector operations: 1. V V Example: Complement all elements 2. S V Examples: Min, Max, Sum 3. V V x V Examples: Vector addition, multiplication, division 4. V V x S Examples: Multiply or add a scalar to a vector 5. S V x V Example: Calculate an element of a matrix One instruction says, in effect, do the same thing on all the elements of the vector(s). Vector Processors Architecture of Parallel Computers Page 1

2 The generic vector processor: Stream A Multiport Memory System Stream B Pipelined Processor Stream C = A x B Many large-scale scientific and engineering problems can be solved by operations on large vectors or matrices of floating point numbers. Vector processors are designed to efficiently work on these problems. Performance of these machines is measured in: FLOPS Floating Point Operations per Second, MegaFLOPS a million FLOPS, or GigaFLOPS a billion FLOPS. The extremely high performance is achieved only for problems that can be expressed as operations on large vectors. These processors are also called supercomputers, popularized by the CRAY series. The cost/performance ratio of vector processors can be impressive, but the initial cost is high (few of them are built). NEC's SX-4 series, which NEC claims was the most successful supercomputer, sold just 134 systems in 3 years. NEC reports that the SX-5, introduced in June 1998, has received orders for 22 systems over the last year. We also see the attached vector processor an optional vector processing unit attached to a standard scalar computer. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 2

3 Matrix multiplication: Suppose we want to calculate the product of two N x N matrices, C := A x B. We must perform this calculation: c ij := N 1 k = 0 a ik b kj Inner loop of a scalar processor performing the martix multiply The following loop calculates a single element of the matrix C. We must execute this loop N 2 times to get A x B: kloop ; Instructions to initialize 1 iteration of kloop ; (initialize RC, RN, Rk, Ri, Rj) ADD Ri, Stride-i ; Increment column of A ADD Rj, Stride-j ; Increment row of B LOAD RA, A(Ri) ; Get value of Matrix A row LOAD RB, B(Rj) ; Get value of Matrix B column FMPY RA, RB ; Floating multiply FADD RC, RA ; Floating add INC Rk ; Increment k CMP Rk, RN ; At end of Row x Column? BNE kloop ; No -- Repeat for R x C STORE RC, C(r, c) ; Yes -- Store C element ; Continue with all Rows/Columns of C Vector Processors Architecture of Parallel Computers Page 3

4 Vector Processor Operation With a vector processor, we have minimal instructions to set up the vector operation, and the entire inner loop (kloop) consists of three vector instructions: ; Instructions to initialize vector operation VLOAD V1, A(r), N, Stride-i ; Vector load row of A with stride i VLOAD VMPYADD STORE V2, B(c), N, Stride-j V1, V2, RC RC, C(r, c) ; Vector load column of B with stride j ; Vector multiply + add to C ; Store C element ; Continue with all Rows/Columns of C The special vector instruction allows us to calculate each element of C in a single vector floating point instruction (VMPYADD) rather than 2N scalar floating point instructions (FMPY and FADD) and 5N loop control and addressing instructions. In addition, the special vector instruction can keep the floating point pipeline full and generate one result output per clock. For example, if we have a 4-stage floating point addition pipe and a 10-stage floating point multiply pipe: Do we ever get more than one instruction in the pipelines at a time with the kloop sequence of the scalar processor? We will keep both pipelines full with successive multiply/adds on the vector processor. With P independent pipes, we can operate on P elements of C in parallel. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 4

5 FORTRAN is still the preferred language for the majority of the users of vector processors, because the majority of users are scientists and engineers and because there is a large amount of scientific software available in FORTRAN. Example FORTRAN: DO 100 I=1,N A(I) = B(I) + C(I) B(I) = 2 * A(I+1) 100 CONTINUE If we unwind this DO Loop A(1) = B(1) + C(1) B(1) = 2 * A(2) A(2) = B(2) + C(2) B(2) = 2 * A(3)... Vector FORTRAN TEMP(1:N) = A(2:N+1) A(1:N) = B(1:N) + C(1:N) B(1:N) = 2 * TEMP(1:N) Also, some optimizing FORTRAN compilers automatically generate vector code from the original DO Loop. For example, DEC VAX FORTRAN supports the automatic generation of vector operations. [NO]VECTOR Controls whether or not the compiler checks the source code for data dependencies and generates code for the vector hardware when the code is eligible. Vector Processors Architecture of Parallel Computers Page 5

6 An example vector processor: NEC announced the SX-4 supercomputer in November It is the third in the SX series of supercomputers and is upward compatible from the SX- 3R vector processor with enhancements for scalar processing, short vector processing, and parallel processing. The SX-4 has an 8.0 ns clock cycle and a peak performance of 2 Gflops per processor. Each SX-4 processor contains a vector unit and superscalar unit. The vector unit is built using eight vector pipeline processor VLSI chips. Each vector unit chip is a self contained vector unit with registers holding 32 vector elements. The eight chips are connected by crossbar and comprise 32 vector pipelines arranged as sets of eight add/shift, eight multiply, eight divide, and eight logical pipes. Each set of eight pipes serves a single vector instruction, and all sets of pipes can operate concurrently. With a vector add and vector multiply operating concurrently, the pipes provide 2 GFLOPS peak performance. The memory and the processors within each SX-4 node are connected by a nonblocking crossbar. Each processor has a 16 Gbytes per second port into the crossbar. The main memory can have up to 1024 banks of 64-bit wide synchronous static RAM (SSRAM). The SSRAM is composed of 4 Mbit, 15 ns components. Bank cycle time is only two clocks. (Note: NEC has subsequently changed to use Synchronous Dynamic RAM (SDRAM) instead of static RAM). A 32 processor node has a 512 gigabytes per second sustainable memory bandwidth. Conflict free unit stride as well as stride 2 access is guaranteed from all 32 processors simultaneously. Higher strides and list vector access benefit from the very short bank cycle time. Note: The SX-4 achieves the stated 2 GFLOPS by feeding a multiply directly into an add, and concurrently doing this on 8 parallel pipelines. 8 ns per clock = 125 MHz. 125 MHz x 2 FLOPS/clock x 8 pipes = 2 GFLOPS. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 6

7 NEC SX-5 Organization CPU- Central Processing Unit MM- Main Memory Unit IOP - Input-Output Processor VR - Vector Register File SR - Scalar Register File The SX-5 Series employs a 0.25µ CMOS LSI technology. This enables the SX-5 to achieve a clock cycle of 4.0ns, which is half that of the SX-4 Series. Vector Processors Architecture of Parallel Computers Page 7

8 SX-4BA Server SX-4A Single Node SX-4AM Multi Node CPUs CPU Peak 1.8 GF 2 GF 2 GF System Peak 7.2 GF 32 GF 512 GF Clock 8.8 ns 8.0 ns 8.0 ns Memory Type SDRAM SDRAM SDRAM Max.Capacity 16 GB 32 GB 512 GB Max Banking 4,096 8, ,072 IOP (max) 1.6 GB/s 3.2 GB/s 25.6 GB/s XMU Optional Optional Optional Max Bandwidth 3.6 GB/s 8 GB/s 128 GB/s Max.Capacity 8 GB 16 GB 64 GB Table 1: SX-4A Models Overview SX-4 Vector Unit Substantial effort has been made to provide significant vector performance for short vector lengths. The crossover between scalar and vector performance is a short 8 elements in most cases. The vector unit has 8 operational registers from which all operations can be started. In addition, there are 64 vector data registers which have a subset of valid instructions and that can receive results from pipelines concurrently with the 8 operational registers; the vector data registers serve as a high performance vector cache which significantly reduces memory traffic in most cases. The ganging of 8 vector pipeline processor VLSI results in visible vector registers which each hold 256 vector elements. Therefore the vector unit is described as 72 registers of 256 elements of 64 bits each. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 8

9 Revisit the definition of speedup Recall that the speedup of a pipeline measures how much more quickly a workload is completed by the pipeline processor than by a non-pipeline processor. Speedup = Best Parallel Serial Time Execution Time A k-stage pipeline with all stages of equal duration (one clock period) has a theoretical speedup of k because it takes k clocks to get a single operation through the pipe and we are retiring one operation every clock. Vector Processors Architecture of Parallel Computers Page 9

10 We will now look at the actual speedup of a pipeline in a vector processor considering how full we can keep it. Several tasks (operations on the elements of a vector) may be simultaneously active in a pipeline. Space (pipeline stages) S 4 S 3 S 2 T 1 2 T 1 3 T 2 2 T 1 4 T 2 3 T 3 2 T 2 4 T 3 3 T 4 2 T 3 4 T 4 3 T 5 2 T 4 4 T 5 3 T 5 4 S 1 T 1 T 2 T 3 T 4 T Time (pipeline cycles) Suppose there are: k stages in the pipeline, and n tasks to be executed. We have n 1 clocks where the pipeline is not full (startup at the beginning and empty out at the end). So, the speedup S(k) that is achieved when we account for the time it takes to fill the pipeline is given by: S( k) = k nk + ( n 1) As n (number of tasks) approaches infinity, the speedup approaches k (number of stages). Therefore, short vectors get little speedup and long vectors approach maximum speedup. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 10

11 It may be possible to partially overlap finishing one vector operation with starting up another vector operation. A timing diagram would look like this: Space Time Vector instructions must be able to specify the stride for a vector. The elements of a vector may not be stored in consecutive memory locations. For example, in our N x N matrix multiplication, vector A has a stride of 1 (the row) and vector B has a stride of N (the column). A constant stride may be specified such as every other (stride = 2), or every third (stride=3), etc., vector element can be loaded or stored. Many problems involve sparse matrices where the stride is random. In such cases, gather/scatter instructions are used to load and store data under the control of a vector register that contains a pointer to the location of the needed data indirect addressing. An arithmetic operation need not be performed on every element of the vector. In such a case, a mask register is constructed that controls which elements of a vector are loaded, operated on, and stored. Assuming that we get all of the pipeline and logical operations worked out, the main problem with vector processors is feeding them. How much memory bandwidth do we need to feed an SX-4 processor with 64-bit operands? Vector Processors Architecture of Parallel Computers Page 11

12 If we had to feed the pipeline directly from interleaved memory as Stone shows in figure 5.4: Stage 4 Stage 3 Stage 2 Stage 1 Mem. mod RB5 RB5 RA7 RA7 W3 W3 Mem. mod. 6 RB4 RB4 RA6 RA6 W2 W2 Mem. mod. 5 RB3 RB3 RA5 RA5 W1 W1 Mem. mod. 4 RB2 RB2 RA4 RA4 W0 W0 Mem. mod. 3 RB1 RB1 RA3 RA3 Mem. mod. 2 RB0 RB0 RA2 RA2 W6 Mem. mod. 1 RA1 RA1 RB7 RB7 W5 W5 Mem. mod. 0 RA0 RA0 RB6 RB6 W4 W Time (clock periods) The pipeline is running at 8 ns per clock and each operand is given two clocks, so the memory modules must each have an access time of 16 ns. This is a reasonable SRAM access time. Problems: Three of these modules need to transfer their 64-bit data words concurrently to/from the processor pipeline on every clock, requiring three 125 MHz busses into the processor, similar to figure 5.2 in Stone. The three vectors must be stored in the modules as in figure 5.3 such that the access to the memory modules is perfectly synchronized. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 12

13 Back to Interleaved Memory How can we organize memory to provide sequential access faster than any one module cycle time? Recall that interleaved memory places consecutive words of memory in different memory modules: Memory Module 0 Memory Module 1 Memory Module 2 Memory Module 3 Words with addresses = 0 (mod 4) Words with addresses = 1 (mod 4) Words with addresses = 2 (mod 4) Words with addresses = 3 (mod 4) Since a read or write to one module can be started before a read/write to another module finishes, reads/writes can be overlapped. Only the leading bits of the address are used to determine the address within the module. The least-significant bits (in the diagram above, the two least-significant bits) determine the memory module. Thus, by loading a single address into the memory-address register (MAR) and saying read or write, the processor can read/write M words of memory. We say that memory is M-way interleaved. Low-order interleaving distributes the addresses so that consecutive addresses are located within consecutive modules. For example, for 8- way interleaving: Vector Processors Architecture of Parallel Computers Page 13

14 Interleaved-memory designs: Interleaved memory divides an address into two portions: one selects the module, and the other selects an address within the module. Each module has a separate MAR and a separate MDR. When an address is presented, a decoder determines which MAR should be loaded with this address. It uses the low-order m log 2 M bits to decide this. The high-order n m bits are actually loaded into the MAR. They select the proper location within the module. Address within memory module Memory module Address from CPU n m bits m bits Address bus n m m MAR MAR MAR Memory Memory Memory unit unit unit m 1 2 m 1 1 Decoder MDR MDR MDR 0 Data bus 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 14

15 An alternative to feeding a vector processor directly from external storage is to provide a hierarchical memory system similar to cache memory. Memory on the processor chip is called register storage rather than L1 cache, and is managed directly by the programmer rather than automatically by the hardware. A vector processor with high-speed register storage: Main memory Vector load/store FP add/sub. FP multiply Vector registers FP divide Integer Boolean Scalar registers The vector registers are large 64 to 256 floating point numbers each. 256 floating point numbers at 64 bits each times 8 registers is equivalent to a 16k byte internal data cache. Vector Processors Architecture of Parallel Computers Page 15

16 Masking If statements in loops get in the way of vector processors. For example, consider an operation on a vector where you want to do something if the element is not 0. You might code it as the following loop for a scalar processor: for i := 1 to n do if A[i] 0 then A[i] := A[i] - B[i]; This does not work well with a vector processor. We would like to specify an operation on the entire vector A. A vector mask register (VM) holds a boolean vector that can be set to specify if the operation on the corresponding vector element should be performed. The operation on the vector element takes place only of the corresponding mask bit in the VM is 1. For example, the following sequence could be used with the mask register: VLOAD V1, A, N, Stride-i ; Vector load row of A with stride i VLOAD SLOAD V2, B, N, Stride-j S0, #0 ; Vector load column of B with stride j ; Scalar floating point constant 0 VMSNE S0, V1 ; Sets VM bit to 0 if V1[i] = S0 VSUB VMC V1, V2 ; Vector subtract V2 from V1 ; Clear vector mask to all 1 STORE A, V1 ; Store vector A 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 16

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Vector Computers Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Supercomputers Definition of a supercomputer: Fastest machine in world at given task Any machine costing

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD Pipeline and Vector Processing 1. Parallel Processing Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,

More information

COSC 6385 Computer Architecture. - Vector Processors

COSC 6385 Computer Architecture. - Vector Processors COSC 6385 Computer Architecture - Vector Processors Spring 011 Vector Processors Chapter F of the 4 th edition (Chapter G of the 3 rd edition) Available in CD attached to the book Anybody having problems

More information

Advanced Computer Architecture

Advanced Computer Architecture 18-742 Advanced Computer Architecture Test 2 November 19, 1997 Name (please print): Instructions: YOU HAVE 100 MINUTES TO COMPLETE THIS TEST DO NOT OPEN TEST UNTIL TOLD TO START The exam is composed of

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

The Memory Hierarchy & Cache

The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory

More information

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Final Exam - Review Israel Koren ECE568 Final_Exam.1 1. A computer system contains an IOP which may

More information

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram: The CPU and Memory How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram: 1 Registers A register is a permanent storage location within

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

Advanced Topics in Computer Architecture

Advanced Topics in Computer Architecture Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

AMath 483/583 Lecture 11

AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

The Memory Component

The Memory Component The Computer Memory Chapter 6 forms the first of a two chapter sequence on computer memory. Topics for this chapter include. 1. A functional description of primary computer memory, sometimes called by

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Lecture 12: Instruction Execution and Pipelining. William Gropp

Lecture 12: Instruction Execution and Pipelining. William Gropp Lecture 12: Instruction Execution and Pipelining William Gropp www.cs.illinois.edu/~wgropp Yet More To Consider in Understanding Performance We have implicitly assumed that an operation takes one clock

More information

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Reducing Miss Penalty Summary Five techniques Read priority

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Chap. 9 Pipeline and Vector Processing

Chap. 9 Pipeline and Vector Processing 9-1 Parallel Processing = Simultaneous data processing tasks for the purpose of increasing the computational speed Perform concurrent data processing to achieve faster execution time Multiple Functional

More information

CS 252 Graduate Computer Architecture. Lecture 7: Vector Computers

CS 252 Graduate Computer Architecture. Lecture 7: Vector Computers CS 252 Graduate Computer Architecture Lecture 7: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.cs.berkeley.edu/~cs252

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data

More information

PIPELINING AND VECTOR PROCESSING

PIPELINING AND VECTOR PROCESSING 1 PIPELINING AND VECTOR PROCESSING Parallel Processing Pipelining Arithmetic Pipeline Instruction Pipeline RISC Pipeline Vector Processing Array Processors 2 PARALLEL PROCESSING Parallel Processing Execution

More information

Lecture 18: DRAM Technologies

Lecture 18: DRAM Technologies Lecture 18: DRAM Technologies Last Time: Cache and Virtual Memory Review Today DRAM organization or, why is DRAM so slow??? Lecture 18 1 Main Memory = DRAM Lecture 18 2 Basic DRAM Architecture Lecture

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2018 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

EE382 Processor Design. Concurrent Processors

EE382 Processor Design. Concurrent Processors EE382 Processor Design Winter 1998-99 Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and Vector Processors Slide 1 Concurrent Processors Vector processors SIMD and small clustered

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism

More information

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. The Hierarchical Memory System The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory Hierarchy:

More information

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CS 614 COMPUTER ARCHITECTURE II FALL 2005 CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix

More information

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness EE 352 Unit 10 Memory System Overview SRAM vs. DRAM DMA & Endian-ness The Memory Wall Problem: The Memory Wall Processor speeds have been increasing much faster than memory access speeds (Memory technology

More information

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Caches Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

SIMD Parallel Computers

SIMD Parallel Computers CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) SIMD Computers Copyright 2003 J. E. Smith University of Wisconsin-Madison SIMD Parallel Computers BSP: Classic SIMD number

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018 Vector Architectures Intensive Computation Annalisa Massini 2017/2018 2 SIMD ARCHITECTURES 3 Computer Architecture - A Quantitative Approach, Fifth Edition Hennessy Patterson Chapter 4 - Data-Level Parallelism

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory Memory systems Memory technology Memory hierarchy Virtual memory Memory technology DRAM Dynamic Random Access Memory bits are represented by an electric charge in a small capacitor charge leaks away, need

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization CH14 Instruction Level Parallelism and Superscalar Processors Decode and issue more and one instruction at a time Executing more than one instruction at a time More than one Execution Unit What is Superscalar?

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff Lecture 20: ory Hierarchy Main ory and Enhancing its Performance Professor Alvin R. Lebeck Computer Science 220 Fall 1999 HW #4 Due November 12 Projects Finish reading Chapter 5 Grinch-Like Stuff CPS 220

More information

Static Compiler Optimization Techniques

Static Compiler Optimization Techniques Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch

More information

CS 152, Spring 2011 Section 10

CS 152, Spring 2011 Section 10 CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel

More information

MARIE: An Introduction to a Simple Computer

MARIE: An Introduction to a Simple Computer MARIE: An Introduction to a Simple Computer 4.2 CPU Basics The computer s CPU fetches, decodes, and executes program instructions. The two principal parts of the CPU are the datapath and the control unit.

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

Computer System Components

Computer System Components Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,

More information

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) 18-447 Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014 Lab 4 Reminder Lab 4a out Branch handling and branch

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

17 Vector Performance

17 Vector Performance 17 Vector Performance 18-548/15-548 Advanced Computer Architecture Philip Koopman November 9, 1998 Required Reading: Cragon 11.3-11.3.5, 11.7 http://www.ices.cmu.edu/koopman/titan/rules.html Supplemental

More information

Advanced Computer Architecture

Advanced Computer Architecture 18-742 Advanced Computer Architecture Test 2 April 14, 1998 Name (please print): Instructions: DO NOT OPEN TEST UNTIL TOLD TO START YOU HAVE UNTIL 12:20 PM TO COMPLETE THIS TEST The exam is composed of

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory,

More information

Optimizing INTBIS on the CRAY Y-MP

Optimizing INTBIS on the CRAY Y-MP Optimizing INTBIS on the CRAY Y-MP Chenyi Hu, Joe Sheldon, R. Baker Kearfott, and Qing Yang Abstract INTBIS is a well-tested software package which uses an interval Newton/generalized bisection method

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control,

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control, UNIT - 7 Basic Processing Unit: Some Fundamental Concepts, Execution of a Complete Instruction, Multiple Bus Organization, Hard-wired Control, Microprogrammed Control Page 178 UNIT - 7 BASIC PROCESSING

More information

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

EE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination

EE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination 1 Student name: Date: June 26, 2008 General requirements for the exam: 1. This is CLOSED BOOK examination; 2. No questions allowed within the examination period; 3. If something is not clear in question

More information

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar Vector Processors Kavitha Chandrasekar Sreesudhan Ramkumar Agenda Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length

More information

Memory System Design. Outline

Memory System Design. Outline Memory System Design Chapter 16 S. Dandamudi Outline Introduction A simple memory block Memory design with D flip flops Problems with the design Techniques to connect to a bus Using multiplexers Using

More information

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates

More information

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Main Memory Systems. Department of Electrical Engineering Stanford University   Lecture 5-1 Lecture 5 Main Memory Systems Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 5-1 Announcements If you don t have a group of 3, contact us ASAP HW-1 is

More information

A Brief Description of the NMP ISA and Benchmarks

A Brief Description of the NMP ISA and Benchmarks Report No. UIUCDCS-R-2005-2633 UILU-ENG-2005-1823 A Brief Description of the NMP ISA and Benchmarks by Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine February 2005 A Brief Description

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Chapter 6 (Lect 3) Counters Continued. Unused States Ring counter. Implementing with Registers Implementing with Counter and Decoder

Chapter 6 (Lect 3) Counters Continued. Unused States Ring counter. Implementing with Registers Implementing with Counter and Decoder Chapter 6 (Lect 3) Counters Continued Unused States Ring counter Implementing with Registers Implementing with Counter and Decoder Sequential Logic and Unused States Not all states need to be used Can

More information

2 MARKS Q&A 1 KNREDDY UNIT-I

2 MARKS Q&A 1 KNREDDY UNIT-I 2 MARKS Q&A 1 KNREDDY UNIT-I 1. What is bus; list the different types of buses with its function. A group of lines that serves as a connecting path for several devices is called a bus; TYPES: ADDRESS BUS,

More information

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23) Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

Typical Processor Execution Cycle

Typical Processor Execution Cycle Typical Processor Execution Cycle Instruction Fetch Obtain instruction from program storage Instruction Decode Determine required actions and instruction size Operand Fetch Locate and obtain operand data

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!) 7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University DRAMs Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Main Memory & Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics Chapter 4 Objectives Learn the components common to every modern computer system. Chapter 4 MARIE: An Introduction to a Simple Computer Be able to explain how each component contributes to program execution.

More information