Lecture 8: RISC & Parallel Computers. Parallel computers

Similar documents
Lecture 4: RISC Computers

Lecture 4: RISC Computers

Lecture 7: Parallel Processing

Lecture 9: MIMD Architecture

Lecture 7: Parallel Processing

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures

REDUCED INSTRUCTION SET COMPUTERS (RISC)

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Chapter 18 Parallel Processing

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Chapter 13 Reduced Instruction Set Computers

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Computer Organization. Chapter 16

Parallel computer architecture classification

Parallel Processors. Session 1 Introduction

Major Advances (continued)

Module 5 Introduction to Parallel Processing Systems

Organisasi Sistem Komputer

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

EC 513 Computer Architecture

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

RISC Processors and Parallel Processing. Section and 3.3.6

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Processor Architecture and Interconnect

CSCI 4717 Computer Architecture

BlueGene/L (No. 4 in the Latest Top500 List)

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

RISC Principles. Introduction

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 18. Parallel Processing. Yonsei University

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

Computer Architecture and Data Manipulation. Von Neumann Architecture

Fundamentals of Computer Design

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

ECE 486/586. Computer Architecture. Lecture # 7

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Fundamentals of Computers Design

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Classification of Parallel Architecture Designs

PIPELINE AND VECTOR PROCESSING

Advanced Computer Architecture. The Architecture of Parallel Computers

Unit 9 : Fundamentals of Parallel Processing

RAID 0 (non-redundant) RAID Types 4/25/2011

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Chapter 2: Data Manipulation

Lecture 2: Memory Systems

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Flynn s Classification

Introduction to parallel computing

Top500 Supercomputer list

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

omputer Design Concept adao Nakamura

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Lecture 2 Parallel Programming Platforms

Chap. 4 Multiprocessors and Thread-Level Parallelism

Real instruction set architectures. Part 2: a representative sample

Microprocessor Architecture Dr. Charles Kim Howard University

Comp. Org II, Spring

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Lecture 4: Instruction Set Design/Pipelining

Architecture of parallel processing in computer organization

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallel Processing & Multicore computers

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Instruction Set Principles. (Appendix B)

Comp. Org II, Spring

CS Computer Architecture

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

COMPUTER STRUCTURE AND ORGANIZATION

Computer Architecture

Lecture 24: Virtual Memory, Multiprocessors

Objectives of the Course

Topics in computer architecture

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Online Course Evaluation. What we will do in the last week?

Advanced Parallel Architecture. Annalisa Massini /2017

Computer Systems Laboratory Sungkyunkwan University

Dr. Joe Zhang PDC-3: Parallel Platforms

Control unit. Input/output devices provide a means for us to make use of a computer system. Computer System. Computer.

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

COSC 6385 Computer Architecture - Multi Processor Systems

Chapter 2: Instructions How we talk to the computer

Multiprocessors & Thread Level Parallelism

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

WHY PARALLEL PROCESSING? (CE-401)

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Chapter 11. Introduction to Multiprocessors

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

CS 101, Mock Computer Architecture

2 MARKS Q&A 1 KNREDDY UNIT-I

Outline Marquette University

Instruction Set Principles and Examples. Appendix B

Chapter 1: Perspectives

Transcription:

Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer architecture. It is an attempt to produce more computation power by simplifying the instruction set of a computer. The opposed trend is Complex Instruction Set Computer (CISC), or the regular computers. Note, however, most recent processors are not typical RISC or CISC, but combine advantages of both approaches. Ex. new versions of PowerPC and recent Pentium processors. Zebo Peng, IDA, LiTH 2 1

Main features of CISC The main idea is to make machine instructions similar to high-level language statements. Ex. CASE (switch) instruction on the VAX computer. A large number of instructions (> 200) with complex instructions and data types. Many and complex addressing modes (e.g., indirect addressing is often used). Microprogramming techniques are used to implement the control unit, due to its complexity. Zebo Peng, IDA, LiTH 3 Arguments for CISC A rich instruction set can simplify the compiler by having instructions which match HLL statements. This works fine if the number of HL languages is small. If a program is smaller in size, due to the use of complex machine instructions, it might have better performance: It takes up less memory space and need fewer instruction fetches, which is important when we have limited memory space. Fewer number of instructions are executed, which may lead to smaller execution time. Program execution efficiency can be improved by implementing operations in microcode rather than machine code. Zebo Peng, IDA, LiTH 4 2

Problems with CISC A large instruction set requires complex and time consuming hardware steps to decode and execute instructions. Some complex machine instructions may not match HLL statements, in which case they may be of little use. This will be a major problem if the number of HL languages is getting bigger. Memory access is a major issue, due to complex addressing modes and multiple memory-access instructions. The irregularity of instruction execution reduces the efficiency of instruction pipelining. It will also lead to a complex design tasks, thus a larger timeto-market, and the use of older technology. Zebo Peng, IDA, LiTH 5 Instruction Execution Features (1) Several studies were conducted to determine the execution characteristics of machine instructions. The studies show: Frequency of machine instructions executed: moves of data: 33% conditional branches: 20% arithmetic/logic operations: 16% others: between 0.1% and 10% ca. 70% Addressing modes: the majority of instructions uses simple addressing modes. complex addressing modes (memory indirect, indexed + indirect, etc.) are only used by ~18% of the instructions. Zebo Peng, IDA, LiTH 6 3

Instruction Execution Features (2) Operand types: ca. 80% of the operands are scalars (integers, reals, characters, etc.). the rest (ca. 20%) are arrays/structures, and 90% of them are global variables. Ca. 80% of the scalars are local variables. Conclusion: 1) the majority of operands are local variables of scalar type, which can be stored in registers. 2) implementing many registers in an architecture is good, since we can reduce the number of memory accesses. Zebo Peng, IDA, LiTH 7 Procedure Calls Studies has also been done on the percentage of the CPU execution time spent executing different HLL statements. Most of the time is spent executing CALLs and RETURNs in HLL programs. Even if only 15% of the executed HLL statements is a CALL or RETURN, they consume most of the execution time, because of their complexity. A CALL or RETURN is compiled into a relatively long sequence of machine instructions with a lot of memory references. The contents of registers must be saved to memory and reinstall from it with each procedure CALL and RETURN. Zebo Peng, IDA, LiTH 8 4

Evaluation Conclusions An overwhelming dominance of simple (ALU and move) operations over complex operations. There is also a dominance of simple addressing modes. Large frequency of operand accesses; on average each instruction references 1.9 operands. Most of the referenced operands are scalars (can therefore be stored in registers) and are local variables or parameters. Optimizing the procedure CALL/RETURN mechanism promises large benefits in performance improvement. The RISC computers are developed to take advantages of the above results, and thus delivering better performance! Zebo Peng, IDA, LiTH 9 Main Characteristics of RISC A small number of simple instructions (desirably < 100). Simple and small decode and execution hardware is required. A hard-wired controller is needed, rather than microprogramming. CPU takes less silicon area to implement, and runs also faster. Execution of one instruction per clock cycle. The instruction pipeline performs more efficiently due to simple instructions and similar execution patterns. Complex operations are executed as a sequence of simple instructions. In the case of CISC they are executed as one single or a few complex instructions. Zebo Peng, IDA, LiTH 10 5

Main Characteristics of RISC (Cont d) Load-and-store architecture Only LOAD and STORE instructions reference data in memory. All other instructions operate only with registers (register-toregister instructions). Only a few simple addressing modes are used. Ex. register, direct, register indirect, and displacement. Instructions are of fixed length and uniform format. Loading and decoding of instructions are simple and fast. It is not needed to wait until the length of an instruction is known in order to start decoding it. Decoding is simplified because the opcode and address fields are located in the same position for all instructions. Zebo Peng, IDA, LiTH 11 Main Characteristics of RISC (Cont d) A large number of registers can be implemented. Variables and intermediate results can be stored in them in order to reduce memory accesses. All local variables of procedures and the passed parameters can also be stored in the registers. The large number of registers is due to that the reduced complexity of the processor leaves silicon space on the chip to implement them. This is usually not the case with CISC machines. Zebo Peng, IDA, LiTH 12 6

Register Windows A large number of registers is usually very useful. However, if contents of all registers must be saved at every procedure call, more registers mean longer delay. A solution to this problem is to divide the register file into a set of fixed-size windows. Each window is assigned to a procedure. Windows for adjacent procedures are overlapped to allow parameter passing. Zebo Peng, IDA, LiTH 13 Main Advantages of RISC Best support is given by optimizing the most used and most time consuming architecture aspects. Frequently executed instructions; Simple memory reference; Procedure call/return; and Pipeline design. Consequently, we have High performance for many applications; Less design complexity; Reduced power consumption; and reducing design cost and time-to-market, leading also to the use of newer technology. Zebo Peng, IDA, LiTH 14 7

Criticism of RISC An operation might need two, three, or more instructions to accomplish. More memory access cycles might be needed. Execution speed may be reduced for certain applications. It usually leads to longer programs, which needs a larger memory space to store. This is less a problem now with the development of memory technologies. RISC makes it more difficult to program machine codes and assembly programs. Programming becomes more time consuming. Luckily very little low level programming is done nowadays. Zebo Peng, IDA, LiTH 15 Historical CISC/RISC Examples Characteristics CISC Machines RISC Machines Processor IBM370 VAX i486 SPARC MIPS Year developed 1973 1978 1989 1987 1991 No. of instructions 208 303 235 69 94 Inst. size (bytes) 2-6 2-57 1-12 4 4 Addressing modes 4 22 11 1 1 No. of G.P. registers 16 16 8 40-520 32 Cont. M. size (Kbits) 420 480 246 - - Zebo Peng, IDA, LiTH 16 8

A CISC Example Pentium I 32-bit integer processor: Registers: 8 general; 6 for segmented addresses (16-bits); 2 status/control; 1 instruction-pointer (PC). Two instruction pipelines: u-pipe can execute any instructions. v-pipe executes only simple instructions (control is hardwired!). On-chip floating-point unit. Pre-fetch buffer for instructions (16 bytes of instructions). Branch target buffer (512-entry branch history table). Microprogrammed control. Instruction set: Number of instructions: 235 (+ 57 MMX instructions). e.g., FYL2XP1 x y performs y log 2 (x+1) in 313 clock cycles. Instruction size: 1-12 bytes (3.2 on average). Addressing modes: 11. Zebo Peng, IDA, LiTH 17 CISC Pentium I (Cont d) Memory organization: Address length: 32 bits (4 GB space). Support virtual memory by paging. On-chip separate 8 KB I-cache and 8K D-cache. D-cache can be configured as write-through or write-back. L2 cache: 256 KB or 512 KB. Instruction execution: Execution models: reg-reg, reg-mem, and mem-mem. Five-stage instruction pipeline: Fetch, Decode1, Decode2 (address generation), Execute, and Write-Back. Zebo Peng, IDA, LiTH 18 9

A RISC Example SPARC Scalable Processor Architecture: 32-bit processor: An Integer Unit (IU) to execute all instructions. A Floating-Point Unit (FPU) co-processor working concurrently with IU for arithmetic operations on floating point numbers. 69 basic instructions. A linear, 32-bit virtual-address space (4G). 40 to 520 general purpose 32-bit registers. grouped into 2 to 32 overlapping reg windows + 8 global registers. Each group has16 registers. Scalability. Latest high-end processors (2016) are Fujitsu's SPARC64 X+, which is designed for 64-bit operations. Zebo Peng, IDA, LiTH 19 Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 20 10

Motivation for Parallel Processing Traditional computers are not able to meet high-performance requirements for many applications, such as: Simulation of large complex systems in physics, economy, biology... Distributed data base with search function. Artificial intelligence and autonomous systems. Visualization and multimedia. Such applications are characterized often by a huge amount of numerical computations and/or a high quantity of input data. The only way to deliver the needed performance is to implement many processors in a single computer system. Hardware and silicon technology make it also possible to build machines with huge degree of parallelism cost effectively. Zebo Peng, IDA, LiTH 21 Flynn s Classification of Computers Based on the nature of the instruction flow executed by a computer and the data flow on which the instructions operate. The multiplicity of instruction stream and data stream gives us four different classes: Single instruction, single data stream SISD Multiple instruction, single data stream MISD Single instruction, multiple data stream SIMD Multiple instruction, multiple data stream MIMD Zebo Peng, IDA, LiTH 22 11

Single Instruction, Multiple Data SIMD A single machine instruction stream. Simultaneous execution on different sets of data. A large number of processing elements is usually implemented. Lockstep synchronization among the process elements. The processing elements can: have their respective private data memory; or share a common memory via an interconnection network. Array and vector processors are the most common examples of SIMD machines. Zebo Peng, IDA, LiTH 23 Vector Processor Architecture Scalar Instructions Scalar Unit Scalar Registers Instruction Fetch and Decode unit Scalar Functional Units Vector Registers Memory System Vector Instructions Vector Functional Units Vector Unit Zebo Peng, IDA, LiTH 24 12

Array Processors Built with N identical processing elements and a number of memory modules. All PEs are under the control of a single control unit. They execute instructions in a lock-step manner. Processing units and memory elements communicate with each other through an interconnection network. Different topologies can be used, e.g., a mesh. Complexity of the control unit is at the same level of the uniprocessor system. The control unit is usually itself implemented as a computer with its own registers, local memory and ALU (the host computer). Zebo Peng, IDA, LiTH 25 Global Memory Organization Control Unit IS PE1 PE2... PEn Interconnection Network M 1 M 2... M k I/O System Shared Memory Zebo Peng, IDA, LiTH 26 13

Dedicated Memory Organization Control Unit M cont IS PE1 PE2 M 1 M 2... Interconnection Network PEn M n I/O System Zebo Peng, IDA, LiTH 27 Performance Measurement Units MIPS = Millions of Instructions Per Second. FLOPS = FLoating Point Operations Per Second. kiloflops (kflops) = 10 3 megaflops (MFLOPS) = 10 6 gigaflops (GFLOPS) = 10 9 teraflops (TFLOPS) = 10 12 petaflops (PFLOPS) = 10 15 exaflops (EFLOPS) = 10 18 zettaflops (ZFLOPS) = 10 21 yottaflops (YFLOPS) = 10 24 Zebo Peng, IDA, LiTH 28 14

Multiple Instruction, Multiple Data MIMD It consists of a set of processors. Simultaneously execute different instruction sequences. Different sets of data are operated on. The MIMD class can be further divided: Shared memory (tightly coupled): Symmetric multiprocessor (SMP) Non-uniform memory access (NUMA) Distributed memory (loosely coupled) = Clusters Zebo Peng, IDA, LiTH 29 MIMD with Shared Memory CU1 IS PE1 DS CU2 IS PE2 DS Shared Memory...... CUn IS PEn DS Tightly coupled, not very scalable. Typically called Multi-processor systems. Zebo Peng, IDA, LiTH 30 15

MIMD with Distributed Memory CU1 IS PE1 DS LM1 CU2... IS PE2... DS LM2... Interconnection Network CUn IS PEn DS LMn Loosely coupled with message passing, scalable. Typically called Multi-computer systems. Zebo Peng, IDA, LiTH 31 Symmetric Multiprocessors (Ex) A set of processors of similar capacity. They share same memory. A single memory or a set of memory modules. Memory access time is the same for each processor. Zebo Peng, IDA, LiTH 32 16

Non-Uniform Memory Access (NUMA) Each node is, in general, an SMP. Each node has its own main memory. Each processor has its own L1 and L2 caches. Zebo Peng, IDA, LiTH 33 NUMA (Cont d) Nodes are connected by some networking facility. Each processor sees a single addressable memory space: Each memory location has a unique systemwide address. Zebo Peng, IDA, LiTH 34 17

NUMA (Cont d) Load X Memory request order: L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory (via the interconnect network) Different processors access different regions of memory at different speeds! Zebo Peng, IDA, LiTH 35 Loosely Coupled MIMD Clusters Cluster: A set of computers is connected typically over a highbandwidth local area network, and used as a multi-computer system. A group of interconnected stand-alone computers. Work together as a unified resource. Each computer is called a node. A node can also be a multiprocessor itself, such as an SMP. Message passing for communication between nodes. There is no longer a shared memory space. Ex. Google uses several clusters of off-the-shelf PCs to provide a cheap solution for its services. More than 15,000 nodes form a cluster. Zebo Peng, IDA, LiTH 36 18

Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 37 The Carry-Away Messages Computers are data processing machines built with the von Neumann architecture principle: CPU and memory memory access is the bottleneck. Cache memory Memory hierarchy Growing instruction complexity complicated controller. Microprogramming RISC architecture Sequential execution of instructions speed limitation. Instruction execution pipelining Superscalar execution Parallel processing improves performance of computer systems dramatically the main trend of computers. Zebo Peng, IDA, LiTH 38 19

Saturday(!), Jan 13, 14-18. Written Examination Don t forget to register for the exam. Closed book. But, an English/Swedish dictionary is permitted (NOTE: no electronic dictionary). Answers can be written in English or/and Swedish. Cover only the topics discussed in the lectures. Follow the lecture notes when preparing for the exam. Two previous exams are given at the course website. Reading instructions are also at the course website. Zebo Peng, IDA, LiTH 39 Question & Feedback Dedicated time for answering exam-related questions: Thursday, January 11, 8:15-12:00, at Zebo s office, Building B Please provide feedback of the course: by email; and fill in the course evaluation please! Zebo Peng, IDA, LiTH 40 20