Lecture 8: RISC & Parallel Computers. Parallel computers

Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer architecture. It is an attempt to produce more computation power by simplifying the instruction set of a computer. The opposed trend is Complex Instruction Set Computer (CISC), or the regular computers. Note, however, most recent processors are not typical RISC or CISC, but combine advantages of both approaches. Ex. new versions of PowerPC and recent Pentium processors. Zebo Peng, IDA, LiTH 2 1

Main features of CISC The main idea is to make machine instructions similar to high-level language statements. Ex. CASE (switch) instruction on the VAX computer. A large number of instructions (> 200) with complex instructions and data types. Many and complex addressing modes (e.g., indirect addressing is often used). Microprogramming techniques are used to implement the control unit, due to its complexity. Zebo Peng, IDA, LiTH 3 Arguments for CISC A rich instruction set can simplify the compiler by having instructions which match HLL statements. This works fine if the number of HL languages is small. If a program is smaller in size, due to the use of complex machine instructions, it might have better performance: It takes up less memory space and need fewer instruction fetches, which is important when we have limited memory space. Fewer number of instructions are executed, which may lead to smaller execution time. Program execution efficiency can be improved by implementing operations in microcode rather than machine code. Zebo Peng, IDA, LiTH 4 2

Problems with CISC A large instruction set requires complex and time consuming hardware steps to decode and execute instructions. Some complex machine instructions may not match HLL statements, in which case they may be of little use. This will be a major problem if the number of HL languages is getting bigger. Memory access is a major issue, due to complex addressing modes and multiple memory-access instructions. The irregularity of instruction execution reduces the efficiency of instruction pipelining. It will also lead to a complex design tasks, thus a larger timeto-market, and the use of older technology. Zebo Peng, IDA, LiTH 5 Instruction Execution Features (1) Several studies were conducted to determine the execution characteristics of machine instructions. The studies show: Frequency of machine instructions executed: moves of data: 33% conditional branches: 20% arithmetic/logic operations: 16% others: between 0.1% and 10% ca. 70% Addressing modes: the majority of instructions uses simple addressing modes. complex addressing modes (memory indirect, indexed + indirect, etc.) are only used by ~18% of the instructions. Zebo Peng, IDA, LiTH 6 3

Instruction Execution Features (2) Operand types: ca. 80% of the operands are scalars (integers, reals, characters, etc.). the rest (ca. 20%) are arrays/structures, and 90% of them are global variables. Ca. 80% of the scalars are local variables. Conclusion: 1) the majority of operands are local variables of scalar type, which can be stored in registers. 2) implementing many registers in an architecture is good, since we can reduce the number of memory accesses. Zebo Peng, IDA, LiTH 7 Procedure Calls Studies has also been done on the percentage of the CPU execution time spent executing different HLL statements. Most of the time is spent executing CALLs and RETURNs in HLL programs. Even if only 15% of the executed HLL statements is a CALL or RETURN, they consume most of the execution time, because of their complexity. A CALL or RETURN is compiled into a relatively long sequence of machine instructions with a lot of memory references. The contents of registers must be saved to memory and reinstall from it with each procedure CALL and RETURN. Zebo Peng, IDA, LiTH 8 4

Evaluation Conclusions An overwhelming dominance of simple (ALU and move) operations over complex operations. There is also a dominance of simple addressing modes. Large frequency of operand accesses; on average each instruction references 1.9 operands. Most of the referenced operands are scalars (can therefore be stored in registers) and are local variables or parameters. Optimizing the procedure CALL/RETURN mechanism promises large benefits in performance improvement. The RISC computers are developed to take advantages of the above results, and thus delivering better performance! Zebo Peng, IDA, LiTH 9 Main Characteristics of RISC A small number of simple instructions (desirably < 100). Simple and small decode and execution hardware is required. A hard-wired controller is needed, rather than microprogramming. CPU takes less silicon area to implement, and runs also faster. Execution of one instruction per clock cycle. The instruction pipeline performs more efficiently due to simple instructions and similar execution patterns. Complex operations are executed as a sequence of simple instructions. In the case of CISC they are executed as one single or a few complex instructions. Zebo Peng, IDA, LiTH 10 5

Main Characteristics of RISC (Cont d) Load-and-store architecture Only LOAD and STORE instructions reference data in memory. All other instructions operate only with registers (register-toregister instructions). Only a few simple addressing modes are used. Ex. register, direct, register indirect, and displacement. Instructions are of fixed length and uniform format. Loading and decoding of instructions are simple and fast. It is not needed to wait until the length of an instruction is known in order to start decoding it. Decoding is simplified because the opcode and address fields are located in the same position for all instructions. Zebo Peng, IDA, LiTH 11 Main Characteristics of RISC (Cont d) A large number of registers can be implemented. Variables and intermediate results can be stored in them in order to reduce memory accesses. All local variables of procedures and the passed parameters can also be stored in the registers. The large number of registers is due to that the reduced complexity of the processor leaves silicon space on the chip to implement them. This is usually not the case with CISC machines. Zebo Peng, IDA, LiTH 12 6

Register Windows A large number of registers is usually very useful. However, if contents of all registers must be saved at every procedure call, more registers mean longer delay. A solution to this problem is to divide the register file into a set of fixed-size windows. Each window is assigned to a procedure. Windows for adjacent procedures are overlapped to allow parameter passing. Zebo Peng, IDA, LiTH 13 Main Advantages of RISC Best support is given by optimizing the most used and most time consuming architecture aspects. Frequently executed instructions; Simple memory reference; Procedure call/return; and Pipeline design. Consequently, we have High performance for many applications; Less design complexity; Reduced power consumption; and reducing design cost and time-to-market, leading also to the use of newer technology. Zebo Peng, IDA, LiTH 14 7

Criticism of RISC An operation might need two, three, or more instructions to accomplish. More memory access cycles might be needed. Execution speed may be reduced for certain applications. It usually leads to longer programs, which needs a larger memory space to store. This is less a problem now with the development of memory technologies. RISC makes it more difficult to program machine codes and assembly programs. Programming becomes more time consuming. Luckily very little low level programming is done nowadays. Zebo Peng, IDA, LiTH 15 Historical CISC/RISC Examples Characteristics CISC Machines RISC Machines Processor IBM370 VAX i486 SPARC MIPS Year developed 1973 1978 1989 1987 1991 No. of instructions 208 303 235 69 94 Inst. size (bytes) 2-6 2-57 1-12 4 4 Addressing modes 4 22 11 1 1 No. of G.P. registers 16 16 8 40-520 32 Cont. M. size (Kbits) 420 480 246 - - Zebo Peng, IDA, LiTH 16 8

A CISC Example Pentium I 32-bit integer processor: Registers: 8 general; 6 for segmented addresses (16-bits); 2 status/control; 1 instruction-pointer (PC). Two instruction pipelines: u-pipe can execute any instructions. v-pipe executes only simple instructions (control is hardwired!). On-chip floating-point unit. Pre-fetch buffer for instructions (16 bytes of instructions). Branch target buffer (512-entry branch history table). Microprogrammed control. Instruction set: Number of instructions: 235 (+ 57 MMX instructions). e.g., FYL2XP1 x y performs y log 2 (x+1) in 313 clock cycles. Instruction size: 1-12 bytes (3.2 on average). Addressing modes: 11. Zebo Peng, IDA, LiTH 17 CISC Pentium I (Cont d) Memory organization: Address length: 32 bits (4 GB space). Support virtual memory by paging. On-chip separate 8 KB I-cache and 8K D-cache. D-cache can be configured as write-through or write-back. L2 cache: 256 KB or 512 KB. Instruction execution: Execution models: reg-reg, reg-mem, and mem-mem. Five-stage instruction pipeline: Fetch, Decode1, Decode2 (address generation), Execute, and Write-Back. Zebo Peng, IDA, LiTH 18 9

A RISC Example SPARC Scalable Processor Architecture: 32-bit processor: An Integer Unit (IU) to execute all instructions. A Floating-Point Unit (FPU) co-processor working concurrently with IU for arithmetic operations on floating point numbers. 69 basic instructions. A linear, 32-bit virtual-address space (4G). 40 to 520 general purpose 32-bit registers. grouped into 2 to 32 overlapping reg windows + 8 global registers. Each group has16 registers. Scalability. Latest high-end processors (2016) are Fujitsu's SPARC64 X+, which is designed for 64-bit operations. Zebo Peng, IDA, LiTH 19 Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 20 10

Motivation for Parallel Processing Traditional computers are not able to meet high-performance requirements for many applications, such as: Simulation of large complex systems in physics, economy, biology... Distributed data base with search function. Artificial intelligence and autonomous systems. Visualization and multimedia. Such applications are characterized often by a huge amount of numerical computations and/or a high quantity of input data. The only way to deliver the needed performance is to implement many processors in a single computer system. Hardware and silicon technology make it also possible to build machines with huge degree of parallelism cost effectively. Zebo Peng, IDA, LiTH 21 Flynn s Classification of Computers Based on the nature of the instruction flow executed by a computer and the data flow on which the instructions operate. The multiplicity of instruction stream and data stream gives us four different classes: Single instruction, single data stream SISD Multiple instruction, single data stream MISD Single instruction, multiple data stream SIMD Multiple instruction, multiple data stream MIMD Zebo Peng, IDA, LiTH 22 11

Single Instruction, Multiple Data SIMD A single machine instruction stream. Simultaneous execution on different sets of data. A large number of processing elements is usually implemented. Lockstep synchronization among the process elements. The processing elements can: have their respective private data memory; or share a common memory via an interconnection network. Array and vector processors are the most common examples of SIMD machines. Zebo Peng, IDA, LiTH 23 Vector Processor Architecture Scalar Instructions Scalar Unit Scalar Registers Instruction Fetch and Decode unit Scalar Functional Units Vector Registers Memory System Vector Instructions Vector Functional Units Vector Unit Zebo Peng, IDA, LiTH 24 12

Array Processors Built with N identical processing elements and a number of memory modules. All PEs are under the control of a single control unit. They execute instructions in a lock-step manner. Processing units and memory elements communicate with each other through an interconnection network. Different topologies can be used, e.g., a mesh. Complexity of the control unit is at the same level of the uniprocessor system. The control unit is usually itself implemented as a computer with its own registers, local memory and ALU (the host computer). Zebo Peng, IDA, LiTH 25 Global Memory Organization Control Unit IS PE1 PE2... PEn Interconnection Network M 1 M 2... M k I/O System Shared Memory Zebo Peng, IDA, LiTH 26 13

Dedicated Memory Organization Control Unit M cont IS PE1 PE2 M 1 M 2... Interconnection Network PEn M n I/O System Zebo Peng, IDA, LiTH 27 Performance Measurement Units MIPS = Millions of Instructions Per Second. FLOPS = FLoating Point Operations Per Second. kiloflops (kflops) = 10 3 megaflops (MFLOPS) = 10 6 gigaflops (GFLOPS) = 10 9 teraflops (TFLOPS) = 10 12 petaflops (PFLOPS) = 10 15 exaflops (EFLOPS) = 10 18 zettaflops (ZFLOPS) = 10 21 yottaflops (YFLOPS) = 10 24 Zebo Peng, IDA, LiTH 28 14

Multiple Instruction, Multiple Data MIMD It consists of a set of processors. Simultaneously execute different instruction sequences. Different sets of data are operated on. The MIMD class can be further divided: Shared memory (tightly coupled): Symmetric multiprocessor (SMP) Non-uniform memory access (NUMA) Distributed memory (loosely coupled) = Clusters Zebo Peng, IDA, LiTH 29 MIMD with Shared Memory CU1 IS PE1 DS CU2 IS PE2 DS Shared Memory...... CUn IS PEn DS Tightly coupled, not very scalable. Typically called Multi-processor systems. Zebo Peng, IDA, LiTH 30 15

MIMD with Distributed Memory CU1 IS PE1 DS LM1 CU2... IS PE2... DS LM2... Interconnection Network CUn IS PEn DS LMn Loosely coupled with message passing, scalable. Typically called Multi-computer systems. Zebo Peng, IDA, LiTH 31 Symmetric Multiprocessors (Ex) A set of processors of similar capacity. They share same memory. A single memory or a set of memory modules. Memory access time is the same for each processor. Zebo Peng, IDA, LiTH 32 16

Non-Uniform Memory Access (NUMA) Each node is, in general, an SMP. Each node has its own main memory. Each processor has its own L1 and L2 caches. Zebo Peng, IDA, LiTH 33 NUMA (Cont d) Nodes are connected by some networking facility. Each processor sees a single addressable memory space: Each memory location has a unique systemwide address. Zebo Peng, IDA, LiTH 34 17

NUMA (Cont d) Load X Memory request order: L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory (via the interconnect network) Different processors access different regions of memory at different speeds! Zebo Peng, IDA, LiTH 35 Loosely Coupled MIMD Clusters Cluster: A set of computers is connected typically over a highbandwidth local area network, and used as a multi-computer system. A group of interconnected stand-alone computers. Work together as a unified resource. Each computer is called a node. A node can also be a multiprocessor itself, such as an SMP. Message passing for communication between nodes. There is no longer a shared memory space. Ex. Google uses several clusters of off-the-shelf PCs to provide a cheap solution for its services. More than 15,000 nodes form a cluster. Zebo Peng, IDA, LiTH 36 18

Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 37 The Carry-Away Messages Computers are data processing machines built with the von Neumann architecture principle: CPU and memory memory access is the bottleneck. Cache memory Memory hierarchy Growing instruction complexity complicated controller. Microprogramming RISC architecture Sequential execution of instructions speed limitation. Instruction execution pipelining Superscalar execution Parallel processing improves performance of computer systems dramatically the main trend of computers. Zebo Peng, IDA, LiTH 38 19

Saturday(!), Jan 13, 14-18. Written Examination Don t forget to register for the exam. Closed book. But, an English/Swedish dictionary is permitted (NOTE: no electronic dictionary). Answers can be written in English or/and Swedish. Cover only the topics discussed in the lectures. Follow the lecture notes when preparing for the exam. Two previous exams are given at the course website. Reading instructions are also at the course website. Zebo Peng, IDA, LiTH 39 Question & Feedback Dedicated time for answering exam-related questions: Thursday, January 11, 8:15-12:00, at Zebo s office, Building B Please provide feedback of the course: by email; and fill in the course evaluation please! Zebo Peng, IDA, LiTH 40 20