CS 2461: Computer Architecture 1 Program performance and High Performance Processors

Size: px

Start display at page:

Download "CS 2461: Computer Architecture 1 Program performance and High Performance Processors"

Candice Grant
5 years ago
Views:

1 Couse Objectives: Whee ae we. CS 2461: Pogam pefomance and High Pefomance Pocessos Instucto: Pof. Bhagi Naahai Bits&bytes: Logic devices HW building blocks Pocesso: ISA, datapath Using building blocks to assemble a pocesso (LC3) Pogamming the pocesso: Assembly Tanslating high level pogams to Poc Implementing C on LC3 Bits&Bytes to High level Pogams Next Use application witten in high level language Pogam uns on a pocesso How ae high level pogams implemented on pocesso? Run-time stack, allocation of vaiables, tanslation of high level code to machine code Map high level data stuctues to low level data stuctues Stuct to linea mapping in memoy What else does softwae develope want afte pogam is implemented coectly? PERFORMANCE! Pefomance of pogams What to measue Model? Technology tends eal pocessos how to impove pefomance Pipelining, ILP, Multi-coe Memoy oganization basics Memoy hieachy: cache, main memoy, etc. How to ewite you pogam to make it un faste code optimization 1

2 Pefomance of Pogams Complexity of algoithms How good/efficient is you algoithm Measue using Big-Oh notation: O(N log N) Next question : How well is the code executing on the machine??????? Actual time to un the pogam What ae the factos that come into play Whee is the pogam and data stoed What ae the actual machine instuctions executed Why is some HW bette than othes fo diffeent pogams? What factos of system pefomance ae HW elated How does machine instuction set affect pefomance What ae the technology tends and how do they play a ole? Pogam Pefomance: The Geat Reality Ou focus Thee s moe to pefomance than asymptotic complexity Must optimize at multiple levels: algoithm, data epesentations, pocedues, and loops Must undestand system to optimize pefomance How pogams ae compiled and executed How is data stoed What data stuctues ae used How to measue pogam pefomance and identify bottlenecks How to impove pefomance without destoying code modulaity and geneality Technology Tends & Pefomance Speed will depend on clock cycle (fequency) of the cicuits How fast can we switch the tansistos Feed the signal to the gate of MOS tansisto, how long fo the tansisto to thow the switch How lage is the tansisto featue size Mooe s Law Founde of Intel hypothesized on ate of incease in pefomance It is not a law in the sense of laws of physics, etc. Obsevations: pefomance doubles evey 18 months If you knew this, how would it guide you business decisions? Case study: Apple Computes in 85 Delay (ps) Delay vs. Featue Size Gate Delay (ps) Inteconnect Delay (ps) Cu & Low k Inteconnect Delay (ps) Al & SiO Featue Size (nm) Boh, M. T., Inteconnect Scaling - The Real Limite To High Pefomance ULSI, Poceedings of the IEEE Intenational Electon Devices, pages

3 Tansistos / Chip Memoy Capacity (Single Chip DRAM) MPU Tansistos/chip (M) size 1200 DRAM Bits/chip (G) Yea 50 pentiums Yea yea size(mb) cyc time ns ns ns ns ns ns ns The CPU-Memoy Gap Pefomance Tends: Summay ns 100,000,000 10,000,000 1,000, ,000 10,000 1, The inceasing gap between DRAM, disk, and CPU speeds yea Disk seek time DRAM access time SRAM access time CPU cycle time Wokstation pefomance (measued in Spec Maks) impoves oughly 50% pe yea (2X evey 18 months) Pefomance will include not just pocesso, but memoy and disk I/O Impovement in cost pefomance estimated at 70% pe yea 3

4 Pefomance: What to measue? Which of these aiplanes has the best pefomance? Plane DC to Pais Speed Passenges Pefomance? Boeing hous 610 mph 470 BAD/Sud Concode 3 hous 1350 mph 132 The Bottom Line: Pefomance metic depends on application Plane Boeing 747 BAD/Sud Concode DC to Pais 6.5 hous 3 hous Speed 610 mph 1350 mph Passenges Time to un the task (Execution Time/Response Time/Latency) Time to tavel fom DC to Pais Tasks pe unit time (Thoughput/Bandwidth) Passenge miles pe hou; how many passenges tanspoted pe unit time Thoughput (pmph) 286, ,200 Compute Pefomance: TIME, TIME, TIME Response Time (latency) How long does it take fo my job to un? How long does it take to execute a job? How long must I wait fo the database quey? Thoughput How many jobs can the machine un at once? What is the aveage execution ate? How much wok is getting done? Metic chosen usually depends on use community: sys admin vs single use? If we upgade a machine with a new pocesso what do we incease? If we add a new machine to the lab what do we incease? 4

5 Execution Time How to Model Pefomance Elapsed Time counts eveything (disk and memoy accesses, I/O, etc.) a useful numbe, but often not good fo compaison puposes CPU time doesn't count I/O o time spent unning othe pogams can be boken up into system time, and use time Ou focus in this couse: use CPU time time spent executing the lines of code that ae "in" ou pogam The asymptotic complexity big O Time = O( f(n)) : function of the size of the input Soting O(n log n) This measues efficiency of you algoithm i.e., how good is solution technique Is this enough when we talk of actual time measued on the pocesso??? Thee s moe to pefomance than asymptotic complexity Must optimize at multiple levels: algoithm, data epesentations, pocedues, and loops Must undestand system to optimize pefomance How pogams ae compiled and executed, data stoage, data stuctues, I/O management Pocesso time: how to measue? Numbe of clock cycles it takes to complete the execution of you pogam What is you pogam A numbe of instuctions Diffeent types: load, stoe,, banch Stoed in memoy Executed on the CPU Aspects of CPU Pefomance CPU time = Seconds = Instuctions x Cycles x Seconds Pogam Pogam Instuction Cycle CPU = IC * CPI * Clk 5

6 CPI Aveage CPI Cycles pe instuction Diffeent instuctions may take diffeent time Example in LC 3? We obseved that not evey instuction needs to go though all the instuction execution steps Eg: no need to calculate effective addess, fetch fom memoy o egistes Reality: diffeent times associated with diffeent opeations Especially tue of memoy opeations Application has an instuction mix Pofile of application instuction types, Load/Stoe (memoy), Banch, Jumps, etc. x 1, x 2, x 3 as pecentage ( x1=0.4) Pocesso has CPI fo each type of instuction Pat of ISA of a pocesso.specifications doc Example: =1.0, Load/Stoe=2.0, etc. t 1, t 2, t 3, What is effective CPI? Weighted aveage CPI = x 1 *t 1 + x 2 *t 2 +. CPI: Cycles pe instuction Pinciples of Compute Achitectue Design: Thumb Rules Depends on the instuction Aveage cycles pe instuction Example: Common case fast Focus on impoving those instuctions that ae fequently used Amdahl s Law Faction enhanced/optimized uns faste Pinciple of Locality: pogam spends 90% of its time in 10% of code Eg: wod pocessing Spatial: items nea each othe tend to be accessed Tempoal: ecently used items tend to be used again Concuency/Paallelism Ovelap the instuction execution steps Pipeline pocessos Multi-coe pocessos 6

Amdahl s Law: Speedup Application takes X time How to un it faste Enhance/optimize a potion of it Which potion Can we enhance all of it Note that we ae talking of solving the enhanced

Look at etun on investment Code segments that take long time can give us the best etuns Pofile you code to undestand which pats ae dominating Impoving Pefomance of pocessos: quick

Quick oveview of techniques used in eal pocesso designs Pipelining Instuction level paallel (ILP) pocessos Multitheaded pocessos multi-coe Real-Wold Pipelines: Ca Washes Instuction

7 Amdahl s Law: Speedup Application takes X time How to un it faste Enhance/optimize a potion of it Which potion Can we enhance all of it Note that we ae talking of solving the enhanced pat in a diffeent way, and possibly using diffeent (moe costly) esouces Whee to focus ou optimizations? Look at etun on investment Code segments that take long time can give us the best etuns Pofile you code to undestand which pats ae dominating Impoving Pefomance of pocessos: quick eview Ae eal pocessos like LC 3? How can we impove the pefomance of the pocesso? What design pinciples? Quick oveview of techniques used in eal pocesso designs Pipelining Instuction level paallel (ILP) pocessos Multitheaded pocessos multi-coe Real-Wold Pipelines: Ca Washes Instuction Pipeline Sequential Paallel Instuction execution pocess lends itself natually to pipelining ovelap the subtasks of instuction fetch, decode and execute Pipelined Idea Divide pocess into independent stages Move objects though stages in sequence At any given times, multiple objects being pocessed Inst Fetch Decode Execute Mem Access Wite Back Result 7

8 Conventional Pipelined Execution Repesentat Speedup of Pipelines Time IFetch Dcd Exec Mem WB Instuction 1 IFetch Dcd Exec Mem WB Instuction 2 IFetch Dcd Exec Mem WB Instuction 3 IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Pogam Flow IFetch Dcd Exec Mem WB If we have a k stage pipeline, and n tasks (instuctions) to pocess When does fist instuction complete: k cycles When does 2 nd instuction complete: next cycle (k+1) to complete 2 tasks How long to complete n tasks? Afte fist task, we get one output evey cycle fo next (n-1) cycles Theefoe T k = k + (n-1) cycles Speedup of Pipelines If we have a k stage pipeline, and n tasks (instuctions) to pocess: time to complete n tasks (instuctions)t k = k + (n-1) cycles Time fo non-pipelined = nk cycles Theefoe speedup using k stage pipeline S k = T 1 / T k = nk / (k + n-1) fo lage n, this is ~ nk/n = k Challenges: Data hazads solved in intenal fowading Contol hazads (banches).pediction but still a poblem Example Suppose we execute 100 instuctions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle dain) = 1040 ns 8

9 So how had is it to design a Pipelined Pocesso Go back and examine you datapath and contol diagam associated esouces with states ensue that flows do not conflict, o figue out how to esolve asset contol in appopiate stage Instuction Fetch Next PC Addess Sample Datapath What do we need to do to pipeline the pocess? 4 Adde Memoy Inst Inst. Decode. Fetch Next SEQ PC RS1 RS2 RD Imm File Sign Extend Execute Add. Calc MUX MUX Zeo? Memoy Access MUX Data Memoy L M D Wite Back MUX WB Data 5 Steps of Datapath Visualizing Pipelining Instuction Fetch Inst. Decode. Fetch Execute Add. Calc Memoy Access Wite Back Time (clock cycles) Next PC Addess 4 Adde Memoy IF/ID Next SEQ PC RS1 RS2 Imm File Sign Extend ID/EX Next SEQ PC MUX MUX Zeo? EX/MEM MUX RD RD RD Data Memoy MEM/WB MUX WB Data I n s t. O d e Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 D Latches 9

10 Can t be that easy.poblems? Back to ou old fiend: CPU time equation Limits to pipelining: Hazads pevent next instuction fom executing duing its designated clock cycle and intoduce stall cycles which incease CPI Stuctual hazads: HW cannot suppot this combination of instuctions - two dogs fighting fo the same bone Data hazads: Instuction depends on esult of pio instuction still in the pipeline Data dependencies Contol hazads: Caused by delay between the fetching of instuctions and decisions about changes in contol flow (banches and jumps). Contol dependencies Can always esolve hazads by stalling But, moe stall cycles = moe CPU time = less pefomance Incease pefomance = decease stall cycles Recall equation fo CPU time So what ae we doing by pipelining the instuction execution pocess? Clock? Instuction Count? CPI? How is CPI effected by the vaious hazads? Speed Up Equation fo Pipelining One Memoy Pot/Stuctual Hazads Time (clock cycles) CPI pipelined Ideal CPI Aveage Stall cycles pe Inst Pipeline depth Cycle Time Speedup 1 Pipeline stall CPI Cycle Time unpipelined pipelined I n s t. Load Inst 1 Inst 2 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Moe stalls means lowe pefomance! O d e Inst 3 Inst 4 10

11 One Memoy Pot/Stuctual Hazads Data Dependencies I n s t. O d e Time (clock cycles) Load Inst 1 Inst 2 Stall Inst 3 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Bubble Bubble Bubble Bubble Tue dependencies and False dependencies false implies we can emove the dependency i.e., compile can emove them tue implies we ae stuck with it! Thee types of data dependencies defined in tems of how succeeding instuction depends on peceding instuction RAW: Read afte Wite o Flow dependency WAR: Wite afte Read o anti-dependency WAW: Wite afte Wite Data Hazads False Data Hazads Read Afte Wite (RAW) Inst J ties to ead opeand befoe Inst I wites it I: add 1,2,3 J: sub 4,1,3 Wite afte Read (WAR) Inst J ties to wite opeand befoe Inst I eads it I: add 1,2,3 J: mul 2,5,6 Caused by a Dependence (in compile nomenclatue). This hazad esults fom an actual need fo communication. Caused by a egiste dependence (in compile nomenclatue) can be emoved at compile time by assigning diffeent egistes Assign diffeent egiste to output of instuction J: mul 7, 5, 6 11

12 Intenal Fowading: Getting id of some hazads Data Hazad on R1 In some cases the data needed by the next instuction at the stage has been computed by the (o some stage defining it) but has not been witten back to the egistes Can we fowad this esult by bypassing stages? I n s t. O d e Time (clock cycles) IF ID/RF EX MEM WB add 1,2,3 sub 4,1,3 and 6,1,7 o 8,1,9 xo 10,1,11 Fowading to Avoid Data Hazad Contol Hazads: Banches I n s t. O d e add 1,2,3 sub 4,1,3 and 6,1,7 o 8,1,9 xo 10,1,11 Time (clock cycles) Instuction flow Steam of instuctions pocessed by Inst. Fetch Speed of input flow puts bound on ate of outputs geneated Banch instuction affects instuction flow Do not know next instuction to be executed until banch outcome known When we hit a banch instuction Need to compute taget addess (whee to banch) Resolution of banch condition (tue o false) Might need to flush pipeline if othe instuctions have been fetched fo execution 12

13 Contol Hazad on Banches Thee Stage Stall Solution? 10: beq 1,3,36 Banch pediction algoithms Implemented in hadwae Use histoy to pedict a banch 14: and 2,3,5 18: o 6,1,7 Example: fo loop Banch always taken except fo last iteation 22: add 8,1,9 36: xo 10,1,11 Is this how eal pocessos look? NO moe stuff.. Pipelining is fist step.next is Instuction Level paallelism (ILP What if we had many pipeline units? ILP is tanspaent to the use Multiple opeations executed in paallel even though the system is handed a single pogam witten with a sequential pocesso in mind Same execution hadwae as a nomal RISC machine May be moe than one of any given type of hadwae Achitectues fo ILP Scala Pipeline (baseline) Instuction Paallelism = D Opeation Latency = 1 Peak IPC = 1 (IPC: Instuctions Pe Cycle) SUCCESSIVE INSTRUCTIONS D IF DE EX WB TIME IN CYCLES (OF BASELINE MACHINE) 13

14 Supescala Pocessos Supescala (Pipelined) Execution IP = DxN OL = 1 baseline cycles Peak IPC = N pe baseline cycle Is it that simple... Oppotunities (to speed up things) ae moe but poblems become moe challenging N IF DE EX WB So can SW do anything about the poblems? This is whee you get to claim SW folks ae smate than HW folks! Compile can look at the entie code Analyze dependencies at compile time Rewite code Reaange instuctions to impove paallelism Make bette use of egistes These ae all things that moden compiles do by default! Example 1. ADD 1, 2, 3 {1,2,3} ae dependent 2. MUL 4, 1, 2 on each othe: sequential 3. ADD 2, 4, 3 4. MUL 10, 11, 12 {4,5,6} dependent on each 5. ADD 14, 10, 11 othe: sequential 6. SUB 15, 14, 12 No paallelism in code when pasing sequentially 14

15 Example Example 1. ADD 1, 2, 3 {1,2,3} ae dependent 2. MUL 4, 1, 2 on each othe: sequential 3. ADD 2, 4, 3 4. MUL 10, 11, 12 {4,5,6} dependent on each 5. ADD 14, 10, 11 othe: sequential 6. SUB 15, 14, 12 As a goup {1,2,3} and {4,5,6} ae not dependent on each othe..theefoe: 1. ADD 1, 2, 3 {1,2} ae independent 2. MUL 10, 11, MUL 4, 1, 2 {3,4} independent 4. ADD 14, 10, ADD 2, 4, 3 {5,6} independent 6. SUB 15, 14, 12 Now we have paallelism in code So ae ILP pocessos the eal thing.. Multitheaded Pocessing NO! Even moe techniques: Have you witten pogams with multiple theads (in Java)? Question: can we un theads in paallel? Now we ente the ealm of multi-coe pocessos Poblems become even moe challenging but oppotunities fo pefomance impovement explode! Fine Gain Thead 1 Thead 2 Thead 3 Thead 4 Coase Gain Thead 5 Idle slot 60 15

16 Simultaneous Multi-theading... One thead, 8 units Cycle M M FX FX FP FP BRCC Two theads, 8 units Cycle M M FX FX FP FP BRCC Time (pocesso cycle) Now this is some eal seiou stuff but NO this is not yet the kcka** stuff Supescala Fine-Gained Coase-Gained Simultaneous Multitheading M = Load/Stoe, FX = Fixed Point, FP = Floating Point, BR = Banch, CC = Condition Codes Thead 1 Thead 2 Thead 3 Thead 4 Thead 5 Idle slot Time (pocesso cycle) And NOW we ae talking eal stuff... Multipocessing Pocesso 1 Poc 2 Thead/Code 1 Thead/Code 2 SMT: Simultaneous Multitheading- 1 pocesso 3 4 Theefoe 5 Idle slot Multipocessing SMT Coe 1 Coe 2 Multi-Coe Dual-coe Intel Xeon pocessos Each coe is hype-theaded Pivate L1 caches Shaed L2 caches Intel Xeon Dual-coe C O R E 1 hype-theads L1 cache C O R E 0 L2 cache memoy L1 cache 16

17 Next : Did we foget a key pat of a compute system? Key component in a compute? Memoy How ae eal memoy systems oganized? How do they affect pefomance? 17

COSC 6385 Computer Architecture. - Pipelining

COSC 6385 Computer Architecture. - Pipelining COSC 6385 Compute Achitectue - Pipelining Sping 2012 Some of the slides ae based on a lectue by David Culle, Pipelining Pipelining is an implementation technique wheeby multiple instuctions ae ovelapped