ECE 172 Digital Systems. Chapter 4.2 Architecture. Herbert G. Mayer, PSU Status 6/10/2018

Size: px

Start display at page:

Download "ECE 172 Digital Systems. Chapter 4.2 Architecture. Herbert G. Mayer, PSU Status 6/10/2018"

Annabel Leonard
6 years ago
Views:

1 ECE 172 Digital Systems Chapter 4.2 Architecture Herbert G. Mayer, PSU Status 6/10/2018 1

2 Syllabus l Introduction l Uni Processor l Multi Processor l Instruction Set Architecture l Iron Law l Amdahl's Law l VLIW l Systolic Array l Bibliography 2

3 Introduction l In Digital Systems we focus on digital HW Architecture modules that enable fast operations, primarily computations l Architecture includes registers, memory, caches, processor, bus, peripherals, etc. l Ideal outcome for you: understand and learn to design complete digital computer system l Complete means, fully functional, fast, cheap to build, consuming little power, requiring a small volume, in line with actual priorities l Priorities include: function, schedule, cost, number of developers, evolving technologies, environment, etc. 3

4 Introduction (copy of p. 4, section 4.1) l Key modules of any Computer Architecture: 1. Central Processing Unit, AKA CPU, includes ALU for integers and other numeric types, Register File, pc, ir, flags, and internal registers that are not API visible 2. Memory (AKA Main Memory), including stack and heap 3. Caches: L1 and sometimes L2 integrated on same silicon die, physically but not logically part of CPU 4. Data-, address-, and control buses connecting CPU, peripherals, and memory; AKA System Bus 5. Peripherals, connected via bus 6. IO devices and controller, connected to system bus 7. Branch Prediction unit; invisible to API l Vast speed differences between CPU and memory l Memory may be few times to 2 decimal orders of magnitude slower than CPU speed l Speed disparity between CPU, memory, peripherals! 4

5 Introduction Uniprocessors l Single Accumulator Architecture (earliest systems 1940s), e.g. John von Neumann s computer, or the earlier John Vincent Atanasoff computer l l Were basis for ENIAC Commercial computers actually built and sold l General-Purpose Register Architectures (GPR) l 2-Address Architecture (GPR with one operand implied), e.g. IBM 360 l 3-Address Architecture (GPR with all operands of arithmetic operation explicit), e.g. VAX 11/70 l Stack Machines (e.g. B5000 see [2], B6000, HP3000 see [3]) 5

6 Introduction Multiprocessors l Vector Architecture, e.g. Amdahl 470/6, competing with IBM s 360 in the 1970s; blurs differentiation with Multiprocessor l Yet vector-architecture is still a pure uni-processor architecture l Shared Memory Architecture l Distributed Memory Architecture l Systolic Array Architecture; see Intel iwarp and CMU s warp architecture l Data Flow Machine; see Jack Dennis work at MIT l BSP Burroughs Scientific Processor of the 1970s 6

7 Introduction Hybrid Processors l Superscalar Architecture; see Intel 80860, AKA i860 l VLIW Architecture; see Multiflow computer l Pipelined Architecture; debatable whether UP or hybrid; we postulate: UP l EPIC Architecture; see Intel Itanium architecture l Multi-core processors as crafted today by AMD, HP, IBM, and Intel Corp. 7

8 Common Architecture Attributes l Main memory (main store); separate from CPU l Program instructions stored in main memory l Also, data stored in main memory; known as von Neumann architecture l Data available in distributed over-- main memory, stack, heap, reserved OS space, free space, IO space l Instruction pointer ip (AKA instruction counter ic, program counter pc), other special registers l Von Neumann memory bottle-neck, everything travels on same bus 8

9 Common Architecture Attributes l Accumulator (register, 1 or many) holds result of arithmetic/logical operation l IO Controller handles memory access requests from processor, to memory; AKA chipset l Current trend is to move all or part of memory controller onto CPU chip; does not mean the controller IS part of the CPU! l Processor units include: FP units, Integer unit, control unit, register file, pathways 9

10 Data-Stream, Instruction-Stream l Data-Stream, Instruction-Stream Classification, defined by Michael J. Flynn 1966! l Single-Instruction, Single-Data Stream (SISD) Architecture, e.g. (PDP-11) l Single-Instruction, Multiple-Data Stream (SIMD) Architecture, e.g. Array Processors, Solomon, Illiac IV, BSP, TMC l Multiple-Instruction, Single-Data Stream (MISD) Architecture, e.g. possibly: superscalar machines, pipelined, VLIW, EPIC l Multiple-Instruction, Multiple-Data Stream Architecture (MIMD); perhaps true multiprocessor yet to be built; yes, debatable! (Ignoring marketing hype) 10

11 Generic Computer Architecture Model 11

12 Instruction Set Architecture (ISA) l ISA is boundary between Software (SW) and Hardware (HW) l Specifies logical machine that is visible to the programmer & compiler l Is functional specification for processor designers l Boundary between CPU hardware and system firmware is sometimes a very low-level piece of system software that handles exceptions, interrupts, and HW-specific services l That level could fall into domain of the OS 12

13 Instruction Set Architecture (ISA) l Specified by ISA: l Operations: what to perform and in which order l Temporary Operand Storage in the CPU: registers, accumulator, stack (cache, as an duplicate of memory portions) l Note that stack can be word-sized, even bit-sized (design of successor for NCR s Century architecture of the 1970s) l Number of operands per instruction l Operand location: where and how to specify/locate the operands l Type and size of operands l Instruction Encoding in binary 13

14 Instruction Set Architecture (ISA) ISA: Dynamic Static Interface (DSI) 14

15 Iron Law of Processor Performance l Clock-rate doesn t count, bus width doesn t count, the number of registers and operations executed in parallel doesn t count! l What counts is: how long it takes for computational task to complete. That time is of essence of computing! l If a MIPS-based solution runs at 1 GHz, completing a program X in 2 minutes, while an Intel Pentium 4 based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers and users are more interested in the former solution 15

16 Iron Law of Processor Performance l If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 Y bytes, the Intel solution is generally more attractive l Assuming same execution, performance l Meaning of this: n Wall-clock time (Time) is time I have to wait for completion n Program Size perhaps measured in: bytes of code, bytes of static data space, size of stack and heap used is indicator of overall complexity of computational task, and physical parameters of data 16

17 Iron Law of Processor Performance 17

18 Amdahl s Law l Articulated by Gene Amdahl l During 1967 AFIPS conference l Stating that the maximum speedup of a program P is dominated by its sequential portion S l I.e. if some part of program P can be perfectly accelerated due to very numerous parallel processors, but some part S of P is inherently sequential, then the resulting performance is dominated by S l See Wikipedia sample: next page! 18

19 Amdahl s Law (Source: Wikipedia) The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20 as shown in the diagram, regardless of number of available processors n = element of N, N number of processors B = element of { 0, 1 } T(n) = time to execute with n processors T(n) = T(1) ( B + (1-B) / n ) S(n) = Speedup T(1) / T(n) S(n) = 1 / ( B + (1 B ) / n ) 19

20 Amdahl s Law (Source: Wikipedia) 20

21 Uniprocessor (UP) Architectures l Ancient! Not used today for general computing: l Single Accumulator (SAA) Architecture, e.g. Von Neumann s machine, in the 1940s l Single register to hold operation results l Conventionally called accumulator l Accumulator used as destination of arithmetic operations, and as (one) source l Has central processing unit, memory unit, connecting memory bus l pc points to next instruction (in memory) to be executed next l Commercial sample: ENIAC 21

22 Uniprocessor (UP) Architectures Accumul. Main Mem. pc 22

23 General-Purpose Register (GPR) Architecture l Accumulates ALU results in n registers, n was typically 4, 8, 16, 64 l Allows register-to-register operations, fast! l GPR is essentially a multi-register extension of SA architecture l Two-address architecture specifies one source operand explicitly, another implicitly, plus one destination l Three-address architecture specifies two source operands explicitly, plus an explicit destination l Variations allow additional index registers, base registers, multiple index registers, etc. 23

24 General-Purpose Register (GPR) Architecture 24

25 Stack Machine Architecture (SMA) l AKA zero-address architecture, since arithmetic operations require no explicit operand, hence no operand addresses l All operands are implied, except for push and pop l What is equivalent of push/pop on GPR? l Pure Stack Machine (SMA) has no registers l Hence performance would be poor, as all operations involve memory! l However, one can design an SMA that implements n top of stack elements as registers: Stack Cache l Sample architectures: Burroughs B5000, HP

26 Stack Machine Architecture (SMA) l Implement impure stack operations that bypass tos operand addressing l Sample code sequence to compute on SMA: res := a * ( b ) -- operand sizes are implied! push a -- destination implied: stack pushlit also destination implied push b -- ditto add -- 2 sources, and destination implied mult -- 2 sources, and destination implied pop res -- source implied: stack 26

27 Stack Machine Architecture (SMA) 27

28 Pipelined Architecture (PA) l Arithmetic Logic Unit, ALU, split into separate, sequentially connected units in PA l Unit is referred to as a stage ; more precisely the time at which the action is done is the stage l Each of these stages/units can be initiated once per cycle l Yet each subunit is implemented in HW just once l Multiple subunits operate in parallel on different sub-ops, each executing a different stage; each stage is part instruction execution 28

29 Pipelined Architecture (PA) l Non-unit time, differing # of cycles per operation cause different terminations l Operations can abort in intermediate stage, if a later instruction changes the flow of control l E.g. due to a branch, exception, return, conditional branch, call l Operation must stall in case of operand dependence: stall, caused by interlock; AKA dependency of data or control 29

30 Pipelined Architecture (PA) 30

31 Pipelined Architecture (PA) l Ideally each instruction can be partitioned into the same number of stages, i.e. sub-operations l Operations to be pipelined can sometimes be evenly partitioned into equal-length sub-operations l That equal-length time quantum might as well be a single sub-clock l In practice hard for architect to achieve; compare for example integer add and floating point divide! Vastly different time needs! 31

32 Pipelined Architecture (PA) l Ideally all operations have independent operands l i.e. one operand being computed is not needed as source of the following few operations l If they were needed and often they are then this would cause dependence, which causes a stall 1. read after write (RAW) 2. write after read (WAR) 3. write after write with use in between (WAW) l Also, ideally, all instructions just happen to be arranged sequentially one after another in memory l In reality, there are branches, conditional branches, calls, returns, exceptions, etc. 32

33 Pipelined Architecture (PA) Idealized Pipeline Resource Diagram: 33

34 Multiprocessor (MP) Architectures l Shared Memory Architecture (SMA) l Equal access to memory for all n processors, p 0 to p n-1 l Only one will succeed in accessing shared memory, if there are multiple, simultaneous accesses l Simultaneous access must be deterministic; needed a policy or an arbiter that is deterministic l Von Neumann bottleneck even tighter than for conventional UP system l Typically there are ~twice as many loads as stores 34

35 Multiprocessor (MP) Architectures l Generally, some processors are idle due to memory or other conflict l Typical number of processors n=4, but n=8 and greater possible, with large 2 nd level cache, even larger 3 rd level cache l Early MP architectures had only limited commercial success and acceptance, due to programming burden, frequently loaded onto programmer l Morphing in the 2000s into multi-core and hyperthreaded architectures, where programming burden is on multi-threading OS; i.e. the OS identifies and exploits the threads! 35

36 Multiprocessor (MP) Architectures Yes, 3 CPUs, just to make point of Shared Memory 36

37 Distributed Memory Architecture DMA l Processors have private, AKA local memories l Yet programmer has to see single, logical memory space, regardless of local distribution l Hence each processor p i always has access to its own memory Mem i l And collection of all memories Mem i i= 0..n-1 is program s logical data space l Thus, processors must access others memories l Done via Message Passing or Virtual Shared Memory l Messages must be routed, route be determined l Route may require multiple, intermediate nodes 37

38 Distributed Memory Architecture DMA l Blocking when: message expected but hasn t arrived yet l Blocking when: message to be sent, but destination cannot receive l Growing message buffer size increases illusion of asynchronicity of sending and receiving operations l Key parameter: time for 1 hop and package overhead to send empty message l Message may also be delayed because of network congestion 38

39 Distributed Memory Architecture DMA 39

40 Systolic Array (SA) Architecture l Very few designed: CMU and Intel for (then) ARPA l Each processor has private memory l Network is pre-defined by Systolic Pathway (SP) l Each node is pre-connected via SP to some subset of other processors l Node connectivity: determined by implemented/ selected network topology l Systolic pathway is high-performance network; sending and receiving may be synchronized (blocking) or asynchronous (data received are buffered) l Typical network topologies: line, ring, torus, hex grid, mesh, etc. 40

41 Systolic Array (SA) Architecture l Sample SA below is actually a ring: the wrap-around along x and y direction is not fully shown l Processor can write to x or y gate; sends word off on x or y SP l Processor can read from x or y gate; consumes word from x or y SP l Buffered SA can write to gate, even if receiver cannot read l Attempt to read from gate, when no message is available, will cause blocking! l Automatic code generation for non-buffered SA hard, compiler must keep track of interprocessor synchronization l Can view SP as an extension of memory with infinite capacity, but with sequential access 41

42 Systolic Array (SA) Architecture 42

43 Systolic Array (SA) Architecture l Note that each pathway, x or y, may be bi-directional l May have any number of pathways, nothing magic about 2, x and y l Possible to have I/O capability with each node l Typical application: large polynomials of the form: y = k 0 + k 1 *x 1 + k 2 *x k n-1 *x n-1 = Σ k i *x i Next example shows a torus without displaying the wrap-around pathways across both dimensions 43

44 Systolic Array (SA) Architecture 44

45 Hybrid Architectures l Superscalar (SSA) Architecture l Replicates (duplicates) some operations in HW l Seems like scalar architecture w.r.t. object code l Offers (limited type of) parallel execution, as it has multiple copies of some hardware units l Is not an MP architecture: the multiple units do not have concurrent, independent memory access l Has multiple ALUs, possibly multiple FP add (FPA) units, FP multiply (FPM) units, and/or integer units l Arithmetic operations simultaneous with load and store operations; note data dependence! 45

46 Hybrid Architectures l Instruction fetch in superscalar architecture is speculative, since number of parallel operations unknown; rule: fetch too much! But cannot fetch more than longest possible superscalar pattern l Code sequence looks like sequence of instructions for scalar processor l Example: code executed on Pentium processors l More famous and successful example: processor; see below l Object code can be custom-tailored by compiler; i.e. compiler can have superscalar target processor in mind, bias code emission, knowing that some code sequences are better suited for superscalar execution 46

47 Hybrid Architectures l Fetch enough instruction bytes on superscalar target to support widest (most parallel) possible object sequence l Decoding is bottle-neck for CISC, easier for RISC 32-bit units, or 64-bit units l Sample of superscalar: i80860 has separate FPA, FPM, 2 integer ops, load, store with pre- and post address-increment and decrement l Superscalar, pipelined architecture with maximum of 3 instructions per cycle l In abstract picture next page the pipelined stages are: IF, DE, EX, and WB for: instruction fetch, decoding, execution, and write back of result 47

48 Hybrid Architectures N=3, i.e.3 IPC 48

49 VLIW Architecture (VLIW) l Very Long Instruction Word, typically 128 bits or more l Object code is no longer purely scalar, but explicitly parallel, though parallelism cannot always be exploited l Just like limitation in superscalar: This is not a general MP architecture: The subinstructions do not have concurrent memory access; dependences have to be resolved before code emission l But VLIW opcodes are designed to support some parallel execution l Compiler/programmer explicitly packs parallelizable operations into VLIW instruction 49

50 VLIW Architecture (VLIW) l Just like horizontal microcode compaction l Other opcodes are still scalar, can coexist with VLIW instructions l Partial parallel, even scalar, operations possible by placing no-ops into some of the VLIW fields l Sample: Compute instruction of CMU warp and Intel iwarp l Could be 1-bit (or few-bit) opcode for compute instruction; plus sub-opcodes for subinstructions l Data dependence example: Result of FPA cannot be used as operand for FPM in the same VLIW instruction 50

51 VLIW Architecture (VLIW) l Result of int1 cannot be used as operand for int2, etc. l Thus, need to software-pipeline l Below: this is one VLIW instruction 51

52 EPIC Architecture l Groups instructions into bundles l Straighten out branches by associating predicate with instructions l Execute instructions in parallel, say the else clause and the then clause of an If Statement l Decide at run time which of the predicates is true, and execute just that path of multiple choices l Use speculation to straighten branch tree l Use large, rotating register file l Has many registers, not just 64 GPRs 52

53 Summary: Computer Architecture l Computers are never fast enough, just like people: never rich enough l Speed improvements accomplished through parallelism, multi-processing, pipelining, and resource replication l Some modes of parallelism were dead-ends, e.g. systolic arrays (controversial) l Others offer solid improvement, e.g. pipelining, multi-processing, adequate register number, and multi-cores, etc. 53

54 Bibliography lect11.pdf 8. VLIW Architecture: acrobat_download2/other/vliw-wp.pdf 9. ACM reference to Multiflow computer architecture: id=110622&coll=portal&dl=acm 54

Advanced Computer Architecture

Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes