EE 4980 Modern Electronic Systems. Processor Advanced

EE 4980 Modern Electronic Systems Processor Advanced

Architecture General Purpose Processor User Programmable Intended to run end user selected programs Application Independent PowerPoint, Chrome, Twitter, Angry birds, Embedded Processor Not User Programmable Programmed by manufacturer Application Driven Non-smart phone, appliances, missiles, automobiles, Very wide and very deep applications profile 2 tj

Architecture General Purpose Processor Key Characteristics 32/64 bit operations Support non-real-time/time-sharing operating systems Support complex memory systems Multi-level cache dram Virtual memory Support DMA-driven I/O Complex CPU structures Pipelining Superscalar execution Out-of-order execution (OOO) Floating Point HW 3 tj

Architecture General Purpose Processor Examples ARM 7, 9, Cortex A8, A9,A15 Intel Pentiums, Ix, AMD Phenom, Athleron, Opteron Apple A4, A5 TI OMAPs 4 tj

Architecture Embedded Processor Key Characteristics 4/8/16/32 bit operations Support real-time operating systems Relatively simple memory systems Memory mapped I/O Simple CPU structures Few registers Limited Instructions Support for multiple I/O schemes Wide range of peripheral support A/D D/A Sensors Extensive interrupt support 5 tj

Architecture Embedded Processor Examples Motorola/Freescale 68K, HC11, HCS12 ARM Cortex Rx, Mx Atmel AVR 6 tj

Architecture CISC Complex Instruction Set Computer Name didn t even exist until RISC was defined Used in most processors until about 1980 One instruction holds multiple actions Load data from location, add, write data to new location Many times the instructions were designed to emulate high level language constructs RISC Reduced Instruction Set Computer Developed in the 80s Most prevalent architecture today Sometimes called a load/store architecture Instructions are simple Load data from location Add Store data to location RISC dominates today Much easier to take advantage of advanced structures like Pipelining, Superscalar, OOO 7 tj

Introduction Processor Performance Performance improvement of 24,000 X Frequency Improvement of only 660 X How? Source: Computer Architecture, Hennessy and Patterson, 2012 Elsivier Inc 8 tj

Introduction Processor Performance Faster Transistors Larger Die Pipelining Superscaler OoO Execution SISD -> MIMD Memory Hierarchy Moore s Law 200,000 X Performance improvement of 24,000 X Frequency Improvement of only 660 X How? Source: Computer Architecture, Hennessy and Patterson, 2012 Elsivier Inc 9 tj

Architecture Memory Bus Structure von Neumann Harvard UNIFIED MEMORY INSTRUCTION MEMORY DATA MEMORY ADDRESS ADDRESS CONTROL CONTROL ALU CONTROL CONTROL ALU STATUS STATUS 10 tj

Architecture Memory Bus Structure Modified Harvard UNIFIED MEMORY INSTRUCTION MEMORY DATA MEMORY ADDRESS CONTROL CONTROL ALU STATUS 11 tj

Architecture Cache Memory Modified Harvard UNIFIED MEMORY INSTRUCTION MEMORY ADDRESS DATA MEMORY These memories are often augmented by cache memories or are caches themselves CONTROL CONTROL ALU STATUS 12 tj

Architecture Instruction / Data Structures SIMD Single Instruction Multiple Data SISD Single instruction Single Data SIMD INSTRUCTIONS SISD INSTRUCTIONS P P DATA P DATA P MIMD Multiple Instruction Multiple Data MISD Multiple Instruction Single Data MISD INSTRUCTIONS MIMD P INSTRUCTIONS P P P P P DATA DATA P P P P P P 13 tj

Architecture Cache Memory Cache memory is used to store relatively small amounts of data or program for a relatively short amount of time Sit between the processor and the main memory Fast keep them small to make them fast allow the processor to run faster than main memory would allow Leverage the concept of temporal locality If you have recently used a piece of data you are more likely to use it again Leverage the concept of spatial locality Program code and data structures are generally contiguous in memory 14 tj

Architecture Cache Memory Basic Operation Processor requests a byte of program or data The system first checks to see if the byte is already in the cache if Yes read the byte and continue (called a cache hit) if No stall or allow the processor to do something else (called a cache miss) read the byte from main memory into the cache read the byte from the cache and continue If the cache is full and a new byte needs to be loaded several methods can be used to remove an existing byte LRU least recently used byte is removed FIF0 oldest byte loaded is removed 15 tj

Architecture Pipelining Clock Cycle 0 1 2 3 4 5 Waiting D C D Instructions B C D A B C D CPU Execute A B C D Retired Instructions Execute = fetch instruction, decode, execute, write back No Pipeline A B C D A B C A B A 4us 4us 4us 4us 4us 16 tj

Architecture Pipelining Break complex tasks into smaller chunks Start the next instruction as soon as each subtask is complete Clock Cycle 0 1 2 3 4 5 6 7 8 Pipeline Waiting D C D Instructions B C D A B C D Fetch A B C D Decode A B C D Execute A B C D Write back A B C D Retired Instructions A B C D A B C A B A 1us 1us 1us 1us 1us 1us 1us 1us 17 tj

Pipelining Simple Datapath 18 tj

Pipelining 5 Stages of Instruction Execution Fetch (IF) Decode / Register Access (ID) Execute (EX) Memory Access (MEM) Write Back (WB) Pipeline these at 1 stage each 19 tj

Pipelining Pipeline Performance Pipelining does not reduce the time to execute an instruction In fact it usually increases the instruction execution time Pipelining does increase the instruction throughput Time 1000 1000 1000 IF/ID/EX/MEM/WB 1 2 3 Time 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 IF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 EX 1 2 3 4 5 6 7 8 9 10 11 12 13 MEM 1 2 3 4 5 6 7 8 9 10 11 12 WB 1 2 3 4 5 6 7 8 9 10 11 20 tj

Pipelining Pipeline Performance Non-pipelined 1M Instructions 1x10 9 units of time Pipelined (5 stage) 1M Instructions 2x10 8 5 2x10 8 units of time Overall throughput improvement of 5x 21 tj

Pipelining Pipeline Performance Non-pipelined 1M Instructions 1x10 9 units of time Pipelined (5 stage w/20% penalty per stage) 1M Instructions 2.2x10 8 5 2.2x10 8 units of time Overall throughput improvement of 4.5x 22 tj

Pipelining Pipeline Performance Pipeline stages typically do not all take the same amount of time Stage IF ID/RR EX MEM WB Delay 200ps 100ps 200ps 200ps 100ps Non-pipelined instruction throughput = 1 inst / 800ps Pipelined (5 stage) instruction throughput = 1 inst / 200ps Overall throughput improvement of 4x 23 tj

Pipelining Data Hazards These hazards result from a dependence of one instruction on another instruction still in the pipeline Consider the following code snippit add $s0, $t0, $t1 sub $t2, $s0, $t3 The value of $s0 is needed to perform the subtraction 24 tj

Pipelining Data Hazards add $s0, $t0, $t1 sub $t2, $s0, $t3 Time 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 IF add sub 3 3 4 5 6 7 8 9 10 11 12 13 14 ID add stall stall 3 4 5 6 7 8 9 10 11 12 13 EX add bubble bubble 3 4 5 6 7 8 9 10 11 12 MEM add bubble bubble 3 4 5 6 7 8 9 10 11 WB add bubble bubble 3 4 5 6 7 8 9 10 2 clock cycle bubbles are created It would be 3 bubbles except we can take advantage of our convention writes occur in the first half of the clock cycle reads occur in the second half of the clock cycle the WB occurs during the same clock cycle as the register read 25 tj

Pipelining Data Hazards add $s0, $t0, $t1 sub $t2, $s0, $t3 2 clock cycle bubbles are created It would be 3 bubbles except we can take advantage of our convention writes occur in the first half of the clock cycle reads occur in the second half of the clock cycle the WB occurs during the same clock cycle as the register read 26 tj

Pipelining Control Hazards These hazards result from making a decision while other instructions continue to progress through the pipeline Branch instructions are the most common example don t know whether to load the next instruction or not three approaches stall predict delay 27 tj

Pipelining Control Hazards - stall Do not load the next instruction into the pipeline Time 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 IF add beq 3 3 8 9 10 11 12 13 14 15 16 17 18 ID add beq stall stall 8 9 10 11 12 13 14 15 16 17 EX add beq bubble bubble 8 9 10 11 12 13 14 15 16 MEM add beq bubble bubble 8 9 10 11 12 13 14 15 WB add beq bubble bubble 8 9 10 11 12 13 14 during decode know you have a branch during execute know if taking branch or not PC will be updated Next cycle fetch the next instruction based on PC value 28 tj

Pipelining Control Hazards - stall Even if you add circuitry to detect the branch and update the PC all during the decode can t avoid a stall 29 tj

Pipelining Control Hazards - predict Many algorithms Simplest assume branch will not be taken no penalty if correct stall only when wrong 30 tj

Pipelining Control Hazards predict Predict branch not taken Branch Not Taken Prediction correct! Branch Taken Prediction wrong! 31 tj

Pipelining Control Hazards - predict Static Branch Prediction Predict backward branches - taken Predict forward branches not taken Looping code executes the loop 100 times jumps out of the loop 1 time Dynamic Branch Prediction Keep track of recent branch behavior (for each branch) Assume recent behavior will continue When wrong clear history and start over Hardware intensive 32 tj

Pipelining Mapping the datapath to a pipeline creates a control hazard creates a data hazard 33 tj

Pipelining Pipeline Control 34 tj

Architecture Superscalar Parallelism at the micro-architecture level 35 tj

Introduction Processor Architecture 36 tj

Introduction Processor Architecture 37 tj

Architecture Modern Example 38 tj

Architecture Modern Example 39 tj

Architecture Modern Example 40 tj