پردازش لوله ای و برداری

Size: px

Start display at page:

Download "پردازش لوله ای و برداری"

Marianna Strickland
5 years ago
Views:

1 پردازش لوله ای و برداری )فصل 9 از کتاب )Mano 1

2 پردازش موازی Throughput: the amount of processing that can be accomplished during a given interval of time 2

3 3

4 : طبقه بندی کامپیوترها از نظر Flynn SISD: Single Instruction stream, Single Data stream SIMD: Single Instruction stream, Multiple Data stream MISD MIMD 4

5 طبقه بندی Flynn مبتنی بر جداسازی کارایی واحد کنترل و واحد پردازش داده است. این طبقه بندی همه انواع موجود کامپیوترها را پوشش نمی دهد. مانند پردازش لوله ای In this chapter: 5

6 Laundry Example (by David Patterson) Four loads of clothes: A, B, C, D A B C D Task: each one to wash, dry, and fold Resources Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes 6

7 Sequential Laundry 6 PM Midnight Time 7 T a s k O r d e r A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

8 Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads 6 PM Midnight Time T a s k O r d e r A B C D

9 مثال : 9

10 10 نحوه اجرای دستورات به صورت لوله ای:

11 Speed up: If n>>k then k+n-1 n =>S = t n /t p => S= k n : number of instructions t n : the time for an instruction to execute in the nonepipeline processor k : number of segments t p : clock cycle time in the pipeline stage 11

12 در سیستم های multi functional unit برای استفاده از قابلیت های pipeline می توانیم واحدهای حسابی مربوط به دستور در حال اجرا را تکرار کنیم. 12

13 کاری که درسیستم های SIMD می تواند انجام شود در شکل فوق یک دستور محاسباتی بطور همزمان روی چهار داده مختلف انجام می شود. مثل اینکه یک سیستم لوله ای با چهار خط لوله است. 13

14 Area of pipelining Arithmetic pipeline Instruction pipeline Arithmetic pipeline: Floating point computations(chap. 10) Fixed-point Multiplication(chap. 10) Similar computations 14

15 Example: Floating point Add/Sub 15

16 16

17 17

18 Instruction pipeline Simple example: Two-segment pipeline FIFO Buffer Fetch instruction Execute instruction Mem. Reduction in the access time to memory Instruction pipeline: 18

19 Instruction pipeline problems: 1. Different segment may take different times 2. Some operations are skipped from certain segments.(like as register mode instruction) 3. Two or more segments may require memory access at the same time(separate modules for data and instruction) 4. Running of direct or conditional jump operations need to skip from some instructions 19

20 Example: 4-segment Inst. Pipe. Another memory module 20

21 21

22 Timing of Instr. Pipe. with branch instr. 22

23 Pipeline conflicts: 23

24 2. Data dependency: Data dependency Data add r1,r2,r3 Data Address Address mov r1,[r2] sub r4,r1,r3 and r6,r1,r7 sub r4,[r1],r3 and r6,[r1],r7 24

25 Resolving of Data dependency Hardware interlock: a circuit that detects instructions whose source operands are destinations of instructions farther up in the pipeline( to insert the required delays) Operand forwarding: special hardware to detect a conflict and avoid it by routing the data through special paths between pipeline segments.( like as insert a path from ALU output to destination) Delayed load: compiler method( reorder the instruction as necessary to delay the load) 25

26 3. Branch difficulties Pre-fetch target instruction: to pre-fetch the target instruction in addition to the instruction following the branch. Branch Target Buffer(BTB) : (storing all of prev. branch instr.) Loop buffer: (an extension of BTB: some fast registers to store all of the loop instr. s) Branch prediction ( using additional logic circuit to guess the outcome of a conditional branch) Delayed branch ( rearranging the instructions with compiler to run useful instr. in the branch ex. Cycle) used in RISC processors. 26

27 Example: RISC pipeline One clock cycle for each instruction run. Fixed-length instruction format. Reg. to Reg. operation. Using two memory module: instr. mem. and data mem. Using compiler to optimize pipeline. 27

28 Type of instructions: Data manipulating (operate on registers) Data transfer (just load and store) Program control Three segment instruction pipeline Data manipulating Evaluating the effective address Calculating the branch address 28

29 Example of Delayed load in RISC 29

30 30

31 Example of delayed branch in RISC 31

32 32

33 33

34 34

35 Vector Processing It needs a vast number of computations 35

36 36

37 Interleaved Memory: memory modules with common bus 37

38 Supercomputers Array Processor: A processor that performs computations on large array of data. Attached array processor: an auxiliary processor attached to a general-purpose computer(improvement in numerical computations) SIMD array processor 38

39 VAX 11 computer & FSP-164/MAX Fl. P. s. 39

40 ALU Floating Point Unit- Working Registers An enable for any PE 40

41 1-8.Measuring, Reporting, Summarizing Performance (by D. Patterson) When we say one computer is faster than another is, what do we mean? 41

42 Some definitions: The phrase X is faster than Y is used here to mean that the response time or execution time is lower on X than on Y for the given task. In particular, X is n times faster than Y will mean: The most straightforward definition of time is called wall-clock time, response time, or elapsed time, which is the latency to complete a task, disk accesses, memory accesses, input/output activities, operating system overhead everything. But in multiprogramming this is not true! CPU time recognizes this distinction and means the time the processor is computing, not includ ing the time waiting for I/O or running other programs. 42

43 Wall-clock time, response time, elapsed time the latency to complete a task, including disk accesses, memory accesses, input/output activities, operating system overhead,... CPU time the time the CPU is computing, excluding I/O or running other programs with multiprogramming often further divided into user and system CPU times User CPU time the CPU time spent in the program System CPU time the CPU time spent in the operating system In the multiprogramming: the response time seen by the user is the elapsed time of the program, not the CPU time. 43

44 UNIX time command 90.7u 12.9s 2:39 65% seconds of user CPU time seconds of system CPU time 2:39 - elapsed time (159 seconds) 65% - percentage of elapsed time that is CPU time ( )/159 44

45 CPU 45 time CPU Execution Time CPU clock cycles for CPU clock cycles for a CPUtime Clock rate program Clock Instruction count (IC) = Number of instructions executed Clock cycles per instruction (CPI) CPI CPU clock cycles IC a for a program program cycle CPI - one way to compare two machines with same instruction set, since Instruction Count would be the same time

46 CPU Execution Time (cont d) CPU time IC CPI Clock cycle time CPU time IC CPI Clock rate CPU time Instructions Program Clock cycles Instruction Seconds Clock cycle Seconds Program 46

47 How to Calculate 3 Components? Clock Cycle Time in specification of computer (Clock Rate in advertisements) Instruction count Count instructions in loop of small program Use simulator to count instructions Hardware counter in special register (Pentium II) CPI Calculate: Execution Time / Clock cycle time / Instruction Count Hardware counter in special register (Pentium II) 47

48 Another Way to Calculate CPI First calculate CPI for each individual instruction (add, sub, and, etc.): CPIi Next calculate frequency of each individual instr.: Freqi = ICi/IC Finally multiply these two for each instruction and add them up to get final CPI CPI 48 n i 1 IC IC i CPI i Op ALU Load Store Bran. Freq i 50% 20% 10% 20% CPI i Prod /2.2 % Time 23% 45% 14% 18%

49 Choosing Programs to Evaluate Per. Ideally run typical programs with typical input before purchase, or before even build machine Engineer uses compiler, Author uses word processor, drawing program, compression software Workload mixture of programs and OS commands that users run on a machine Few can do this Don t have access to machine to benchmark before purchase Don t know workload in future 49

50 Benchmarks Different types of benchmarks Real programs (Ex. MSWord, Excel, Photoshop,...) Kernels - small pieces from real programs (Linpack,...) Toy Benchmarks - short, easy to type and run (Quicksort, Puzzle,...) Synthetic benchmarks - code that matches frequency of key instructions and operations to real programs (Whetstone, Dhrystone) Need industry standards so that different processors can be fairly compared Companies exist that create these benchmarks: typical code used to evaluate systems 50

51 Benchmark Suites SPEC - Standard Performance Evaluation Corporation ( originally focusing on CPU performance SPEC , SPEC CPU2000 graphics benchmarks: SPECviewperf, SPECapc server benchmark: SPECSFS, SPECWEB PC benchmarks (Winbench 99, Business Winstone 99, High-end Winstone 99, CC Winstone 99) ( Transaction processing benchmarks ( Embedded benchmarks ( 51

52 Comparing and Summarising Per. An Example Program Com. A Com. B Com. C P1 (sec) P2 (sec) Total (sec) A is 20 times faster than C for program P1 C is 50 times faster than A for program P2 B is 2 times faster than C for program P1 C is 5 times faster than B for program P2 What we can learn from these statements? We know nothing about relative performance of computers A, B, C! One approach to summarise relative performance: use total execution times of programs 52

53 Amdahl s Law Suppose that we make an enhancement to a machine that will improve its performance; Speedup is ratio: Speedup ExTime for entire task without enhancement ExTime for entire task using enhancement Speedup Performance for entire task using enhancement Performance for entire task without enhancement Amdahl s Law states that the performance improvement that can be gained by a particular enhancement is limited by the amount of time that enhancement can be used 53

54 Amdahl s Law gives us a quick way to find the speedup from some enhancement, which depends on two factors: 1. The fraction of the computation time in the original computer that can be converted to take advantage of the enhancement For example, if 20 seconds of the execution time of a program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This value, which we will call Fractionenhanced, is always less than or equal to The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program This value is the time of the original mode over the time of the enhanced mode. If the enhanced mode takes, say, 2 seconds for a portion of the program, while it is 5 seconds in the original mode, the improvement is 5/2. We will call this value, which is always greater than 1, Speedupenhanced. 54

55 Computing Speedup Fractionenhanced = fraction of execution time in the original machine that can be converted to take advantage of enhancement (E.g., 10/30) Speedupenhanced = how much faster the enhanced code will run (E.g., 10/2=5) Execution time of enhanced program will be sum of old execution time of the unenhanced part of program and new execution time of the enhanced part of program: 55 ExTime new ExTime unenhanced ExTime Speedup enhanced enhanced 10/5=2

56 ExTime Enhanced part of program is Fractionenhanced, so times are: new ExTime ExTime unenhanced unenhanced ExTime Speedup ExTime 1 old enhanced enhanced Fraction enhanced ExTime enhanced ExTime old Fraction enhanced Factor out Timeold and divide by Speedupenhanced: Fraction ExTimenew ExTimeold 1 Fractionenhanced Speedup Overall speedup is ratio of Timeold to Timenew: 1 Speedup Fractionenhanced 1 Fraction 56 enhanced Speedup enhanced enhanced enhanced

57 An Example Enhancement runs 10 times faster and it affects 40% of the execution time Fractionenhanced = 0.40 Speedupenhanced = 10 Speedupoverall =? Speedup

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates