Computer Organization and Components Course Structure IS500, fall 05 Lecture :, Concurrency,, and ILP Module : C and Assembly Programming L L L L Module : Processor Design X LAB S LAB L9 L5 Assistant Research ngineer, University of California, Berkeley L6 X LAB L L8 X X LAB L Proj. xpo Slides version.0 L X5 I Abstractions in Computer Systems Agenda Networked Systems and Systems of Systems Application Software Software Operating System Instruction Set Architecture Hardware/Software Interface I Microarchitecture Logic and Building Blocks Digital Hardware Design Digital Circuits Analog Circuits Devices and Physics Analog Design and Physics I S L Computer System LAB6 Module 6: Parallel Processors and Programs Module : Logic Design L7 LAB5 Module 5: Memory Hierarchy Module : I/O Systems Associate Professor, KTH Royal Institute of Technology S L0 I
5 How is this computer revolution possible? (Revisited) 6 Moore s law: Integrated circuit resources (transistors) double every 8- months. By Gordon. Moore, Intel s co-founder, 960s. Possible because refined manufacturing process..g., th generation Intel Core i7 processors uses nm manufacturing. Sometimes considered a self-fulfilling prophecy. Served as a goal for the semiconductor industry. Acknowledgement: The structure and several of the good examples are derived from the book Computer Organization and Design (0) by David A. Patterson and John L. Hennessy I Have we reached the limit? (Revisited) Why? The Power Wall I 7 During the last decade, the clock rate has increased dramatically. 989: 8086, 5MHz 99: Pentium, 66Mhz 997: Pentium Pro, 00MHz 00: Pentium,.0 GHz 00: Pentium,.6 GHz 8 What is a multiprocessor? A multiprocessor is a computer system with two or more processors. By contrast, a computer with one processor is called a uniprocessor. Multicore microprocessors are multiprocessors where all processors (cores) are located on a single integrated circuit. by ric Gaba, CC BY-SA.0. No modifications made. 0: Core i7,. GHz - GHz http://www.publicdomainpictures.net/view-image.php? image=8&picture=tegelvagg Increased clock rate implies increased power We cannot cool the system enough to increase the clock rate anymore New trend since 006: Multicore Moore s law still holds More processors on a chip: multicore New challenge: parallel programming I A cluster is a set of computers that are connected over a local area network (LAN). May be viewed as one large multiprocessor. Photo by Robert Harker I
Different Kinds of Computer Systems (Revisited) 9 0 Why multiprocessors? Possible to execute many computation tasks in parallel. Performance Multiprocessor Photo by Robert Harker Photo by Kyro mbedded Real-Time Systems Personal Computers and Personal Mobile Devices Dependability nergy Dependability Performance I I How much can we improve the performance using parallelization? = Software Serial Sequential Concurrent xample: A Linux OS running on a unicore processor. xample: matrix multiplication on a multicore processor. xecution time after improvement Still increased speedup, but less efficient I Tafter xecution time of one program before improvement Linear speedup (or ideal speedup) xample: A Linux OS running on a multicore processor. Tbefore Superlinear speedup. ither wrong, or due to e.g. cache effects. Parallel Hardware Note: As always, everybody does not agree on the definitions of concurrency and parallelism. The matrix is from H&P 0 and the informal definitions above are similar to what was said in a talk by Rob Pike. If one out of N processors fails, still N- processors are functioning. Concurrency is about handling many things at the same time. Concurrency may be viewed from the software viewpoint. xample: matrix multiplication on a unicore processor. Replace energy inefficient processors in data centers with many efficient smaller processors. Warehouse Scale Computers and Concurrency what is the difference? is about doing (executing) many things at the same time. may be viewed from the hardware viewpoint. nergy Number of processors Danger: Relative speedup measures only the same program True speedup compares also with the best known sequential program, I
Amdahl s Law (/) Amdahl s Law (/) Can we achieve linear speedup? Divide execution time before improvement into two parts. xecution time after improvement = T after T = T affected + T unaffected T after = = Time affected by the improvement of parallelization T affected N T affected N + T unaffected + T unaffected Time unaffected of improvement (sequential part) Amount of improvement (N times improvement) This is sometimes referred to as Amdahl s law = T after = T affected xercise: Assume a program consists of an image analysis task, sequentially followed by a statistical computation task. Only the image analysis task can be parallelized. How much do we need to improve the image analysis task to be able to achieve times speedup? Assume that the program takes 80ms in total and that the image analysis task takes 60ms out of this time. N + T unaffected Solution: = 80 / (60 / N + 80 60) 60/N + 0 = 0 60/N = 0 It is impossible to achieve this speedup! I I Amdahl s Law (/) 5 Amdahl s Law (/) 6 = T after = T affected N Assume that we perform 0 scalar integer additions, followed by one matrix addition, where matrices are 0x0. Assume additions take the same amount of time and that we can only parallelize the matrix addition. xercise A: What is the speedup with 0 processors? xercise B: What is the speedup with 0 processors? xercise C: What is the maximal speedup (the limit when N à infinity) + T unaffected Solution A: (0+0*0) / (0*0/0 + 0) = 5.5 Solution B: (0+0*0) / (0*0/0 + 0) = 8.8 Solution C: (0+0*0) / ((0*0)/N + 0) = when N à infinity I xample continued. What if we change the size of the problem (make the matrices larger)? Size of matrices 0x0 0x0 Number of processors 0 0 5.5 8. 8.8 0.5 But was not the maximal speedup when N à infinity? Strong scaling = measuring speedup while keeping the problem size fixed. Weak scaling = measuring speedup when the problem size grows proportionally to the increased number of processors. I
7 8 Main Classes of s Task-Level (TLP) Different tasks of work that can work in independently and in parallel xample Many tasks at the farm Assume that there are many different things that can be done on the farm (fix the barn, sheep shearing, feed the pigs etc.) Task-level parallelism would be to let the farm hands do the different tasks in parallel. I An old (from the 960s) but still very useful classification of processors uses the notion of instruction and data streams. Data-level parallelism. xamples are multimedia extensions (e.g., SS, streaming SIMD extension), vector processors. Data Stream Single Single Multiple TLP Data-Level (DLP) Many data items can be processed at the same time. xample Sheep shearing Assume that sheep are data items and the task for the farmer is to do sheep shearing (remove the wool). Data-level parallelism would be the same as using several farm hands to do the shearing. Instruction Stream DLP SISD, SIMD, and MIMD Multiple SISD SIMD.g. Intel Pentium.g. SS Instruction in x86 MISD MIMD No examples today.g. Intel Core i7 Graphical Unit Processors (GPUs) are both SIMD and MIMD Task-level parallelism. xamples are multicore and cluster computers Physical Q/A What is a modern Intel CPU, such as Core i7? Stand for MIMD, on the table for SIMD I 9 0 What is? I (ILP) may increase performance without involvement of the programmer. It may be implemented in a SISD, SIMD, and MIMD computer. Two main approaches:. Deep pipelines with more pipeline stages If the length of all pipeline stages are balanced, we may increase performance by increasing the clock speed.. Multiple issue A technique where multiple instructions are issued in each in cycle. ILP may decrease the CPI to lower than, or using the inverse metric instructions per clock cycle (IPC) increase it above. Acknowledgement: The structure and several of the good examples are derived from the book Computer Organization and Design (0) by David A. Patterson and John L. Hennessy I I
Multiple Issue Approaches Static Multiple Issue (/) VLIW Problems with Multiple Issue Determining how many and which instructions to issue in each cycle. Problematic since instructions typically depend on each other. How to deal with data and control hazards. The two main approaches of multiple issue are. Static Multiple Issue Decisions on when and which instructions to issue at each clock cycle is determined by the compiler.. Dynamic Multiple Issue Many of the decisions of issuing instructions are made by the processor, dynamically, during execution. Very Long Instruction Word (VLIW) processors issue several instructions in each cycle using issue packages. add $s0, $s, $s add $t0, $t0, $0 and $t, $t, $s0 lw $t0, 0($s0) F D M W F D M W F D M W F D M W The compiler may insert no-ops to avoid hazards. An issue package may be seen as one large instruction with multiple operations. ach issue package has two issue slots. How is VLIW affecting the hardware implementation? I I PC next A Static Multiple Issue (/) Changing the hardware To issue two instructions in each cycle, we need the following (not shown in picture): 0 RD Instruction Memory + Fetch and decode 6-bit (two instructions) register file W A W RD A A RD RD 0 A WD 0:6 5: Sign xtend 0 << ALU + Double the number of ports for the Data Memory WD 0 Add another ALU Static Multiple Issue (/) VLIW xercise: Assume we have a 5-stage VLIW addi $s, $0, 0 MIPS processor that can issue two instructions L: lw $t0, 0($s) in the same clock cycle. For the following code, ori $t, $t0, 7 schedule the instructions into issue slots and sw $t, 0($s) compute IPC. addi $s, $s, addi $s, $s, - Solution. IPC = ( + 6*0) / ( + *0) =.87 bne $s, $0, L (max ) Slot Slot Cycle Reschedule to addi $s, $0, 0 avoid stalls because of lw L lw $t0, 0($s) addi $s, $s, addi $s, $s, - ori $t, $t0, 7 bne $s, $0, L sw $t, -($s) 5 mpty cells means no-ops I I
Dynamic Multiple Issue Processors (/) Superscalar Processors 5 Dynamic Multiple Issue Processors (/) Out-of-Order xecution, RAW, WAR 6 In many modern processors (e.g., Intel Core i7), instruction issuing is performed dynamically by the processor while executing the program. Such processor is called superscalar. Instruction Fetch and Decode unit RS RS RS RS FU FU FU FU Results are sent back to RS (if a unit waits on the operand) and to the commit unit. Commit Unit I The first unit fetches and decodes several instruction in-order Reservation Stations (RS) buffer operands to the FU before execution. Data dependencies are handled dynamically. Functional Units (FU) execute the instruction. xamples are integer units, floating point units, and load/store units. The commit unit commits instructions (stores in registers) conservatively, respecting the observable order of instructions. Called in-order commit. If the superscalar processor can reorder the instruction execution order, it is an out-of-order execution processor. lw $t0, 0($s) addi $t, $t0, 7 sub $t0, $t, $s0 addi $t, $s0, 0 addi $r, $s0, 0 sub $t0, $t, $s0 xample of a Read After Write (RAW) dependency (dependency on $t0). The superscalar pipeline must make sure that the data is available before read. xample of a Write After Read (WAR) dependency (dependency on $t). If the order is flipped due to out-of-order execution, we have a hazard. WAR dependencies can be resolved using register renaming, where the processor writes to a nonarchitectural renaming register (here in the pseudo asm code called $r, not accessible to the programmer) Note that out-of-order processor is not rewriting the code, but keeps track of the renamed registers during execution. I Some observations on Multi-Issue Processors 7 Intel Microprocessors, some examples 8 Multi Issue Processors VLIW processors tend to be more energy efficient than superscalar out-of-order processors (less hardware, the compiler does the job) Superscalar processors with dynamic scheduling can hide some latencies that are not statically predictable (e.g., cache misses, dynamic branch predictions). Processor Intel 86 Intel Pentium Intel Pentium Pro Intel Pentium Willamette Intel Pentium Prescott Intel Core Intel Core i5 Nehalem Intel Core i5 Ivy Bridge Year 989 99 997 00 00 006 00 0 Clock Rate 5 MHz 66 MHz 00 MHz 000 MHz 600 MHz 90 MHz 00 MHz 00 MHz Pipeline Stages 5 5 0 Issue Cores Power Width 8 5 W 0W 9 W 75W 0W 75W 87W 77W Although modern processors issues to 6 instructions per clock cycle, few applications results in IPC over. The reason is dependencies. Clock rate increase stopped (the power wall) around 006 Pipeline stages first increased The power peaked and then decreased, but the with Pentium number of cores increased after 006. Source: Patterson and Hennessey, 0, page. I I
9 0 Time-Aware Systems - xamples Cyber-Physical Systems (CPS) Bonus Part Time-Aware Systems Design Research Challenges Automotive @ KTH Process Industry and Industrial Automation Aircraft (not part of the xamination) Time-Aware Simulation Systems Time-Aware Distributed Systems Time-stamped distributed systems (.g. Google Spanner) Physical simulations (Simulink, Modelica, etc.) I I Programming Model and Time What is our goal? verything should be made as simple as possible, but not simpler attributed to Albert instein Timing is not part of the software semantics Correct execution of programs (e.g., in C, C++, C#, Java, Scala, Haskell, OCaml) has nothing to do with how long time things takes to execute. Traditional Approach xecution time should be as short as possible, but not shorter Programming Model Objective: Minimize area, memory, energy. Minimize the slack No point in making the execution time shorter, as long as the deadline is met. Deadline Task Challenge: Still guarantee to meet all timing constraints. Slack I Timing Dependent on the Hardware Platform Our Objective Programming Model Make time an abstraction within the programming model Timing is independent of the hardware platform (within certain constraints) I
Are you interested to be challenged? If you are interested in You know the drill Programming language design, or Compilers, or keep calm Computer Architecture You are ambitious and interested in learning new things You want to do a real research project as part of you Bachelor s or Master s thesis project Please send an email to, so that we can discuss some ideas. I I 5 6 Reading Guidelines Summary Module 6: Parallel Processors and Programs Some key take away points: Lecture :, Concurrency,, and ILP H&H Chapter.8,.6, 7.8.-7.8.5 P&H5 Chapters.7-.8,.0,.0, 6.-6. or P&H Chapters.5-.6,.8,.0, 7.-7. Amdahl s law can be used to estimate maximal speedup when introducing parallelism in parts of a program. Lecture : SIMD, MIMD, and Parallel Programming H&H Chapter 7.8.6-7.8.9 P&H5 Chapters.,.6, 5.0, 6.-6.7 or P&H Chapters.,,6, 5.8, 7.-7.7 (ILP) has been very important for performance improvements over the years, but improvements have not been as significant lately. Reading Guidelines See the course webpage for more information. Thanks for listening! I I