CS533: Concepts of Operating Systems

Size: px

Start display at page:

Download "CS533: Concepts of Operating Systems"

Earl Bailey
5 years ago
Views:

1 CS533: Concepts of Operating Systems Spring 2018 Prof. Karen Karavanic 2017 Karen L. Karavanic

2 First Week Plan OS Overview Course Organization Overview Introductions How to read a research paper 2017 Karen L. Karavanic

3 Operating Systems: Major Trends PSU CS 533 Prof. Karen L. Karavanic

4 Acknowledgments This presentation includes materials developed by others: : Introduction to Computer Systems, 2010 Randy Bryant and Dave O Hallaron Kathy Yelick, UC-Berkeley Andrew S. Tanenbaum, Modern Operating Systems Remzi and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces CS 533 Karavanic 4

5 Outline The Single Core Era ( -> 2006) 0. The Operating System is Human 1. Batch Processing 2. MultiProgramming and Timesharing 3. Personal Computing 4. Mobile Computing The Multicore Era (2006 ->) 5. Why Multicore? 6. Manycore CS 533 Karavanic 5

The First Computer! The Single Core Era: Early Days Charles Babbage (1792-1871) and Ada Lovelace: Analytical Engine 1821: Difference Engine No.

6 The First Computer! The Single Core Era: Early Days Charles Babbage ( ) and Ada Lovelace: Analytical Engine 1821: Difference Engine No. 1: add, subtract, solve polynomial equations Required 25,000 precision-crafted parts Difference Engine No. 2: Simpler version Analytical Engine: multiplication, division, algebra 6 CS 533 Karavanic

7 The Single Core Era: Early Days 1990: Science Museum in London builds Difference Engine No. 2 in one year Cost: $500,000 Weight: three tons Size: 11 feet long, 7 feet tall Calculated successive values of seventh-order polynomial equations containing up to 31 digits Proves Babbage s design 7 CS 333 Winter 2015

8 Note: Difference Engine is now at the Computer Museum in San Jose 8 CS 333 Winter 2015

9 The Single Core Era: Early Days First Generation Computing ( ) Goal: compute trajectories for warfare Relays/vacuum tubes (about 20,000) 9 CS 533 Karavanic

10 The Single Core Era: Early Days 10 Programming: plugboards Operating System: none Innovation: punch cards CS 533 Karavanic

11 The Eniac (1946) 11 CS 533 Karavanic

The Single Core Era: Early Days Second Generation Computing (1950-64) Innovation: transistors, mainframes Innovation: first

12 The Single Core Era: Early Days Second Generation Computing ( ) Innovation: transistors, mainframes Innovation: first compiler (fortran) Batch systems Programming: Fortran, assembly language Innovation: Operating Systems: FMS, IBSYS 12 CS 533 Karavanic

13 The Single Core Era #1: Batch Systems Early OS was human: the batch system Operator brings punch cards to 1401 Cards include instruction to stop to load tape or compiler 1401 reads cards to tape Operator puts tape on 7094 which does computing Operator puts tape on 1401 which prints output Switching from one activity to another, loading and saving data were largely manual, unlike today. 13 From A.S. Tannenbaum Modern Operating Systems

14 The Single Core Era #1: Batch Systems The first modern operating systems (mid-1950s): The user submits a job (written on card or tape) to a computer operator The computer operator place a batch of several jobs on a input device A special program, the monitor, manages the execution of each program in the batch Resident monitor is in main memory and available for execution Monitor utilities are loaded when needed 14

The Monitor Monitor reads jobs one at a time from the input device Monitor places a job in the user program area A monitor instruction branches to the

15 The Monitor Monitor reads jobs one at a time from the input device Monitor places a job in the user program area A monitor instruction branches to the start of the user program Execution of user program continues until: end-of-pgm occurs error occurs Causes the CPU to fetch its next instruction from Monitor 15

16 Job Control Language (JCL) Is the language to provide instructions to the monitor what compiler to use what data to use Example of job format: >> $FTN loads the compiler and transfers control to it $LOAD loads the object code (in place of compiler) $RUN transfers control to user program $JOB $FTN... FORTRAN program... $LOAD $RUN... Data... $END 16

17 Job Control Language (JCL) Each read instruction (in user pgm) causes one line of input to be read Causes (OS) input routine to be invoked checks for not reading a JCL line skip to the next JCL line at completion of user program 17 CS 333 Winter 2015

18 Batch OS Alternates execution between user program and the monitor program Relies on available hardware to effectively alternate execution from various parts of memory 18 CS 333 Winter 2015

19 Batch OS: Desirable Hardware Features Memory protection do not allow the memory area containing the monitor to be altered by user programs Timer prevents a job from monopolizing the system an interrupt occurs when time expires Privileged instructions can be executed only by the monitor an interrupt occurs if a program tries these instructions Interrupts provides flexibility for relinquishing control to and regaining control from user programs 19 CS 333 Winter 2015

20 The Single Core Era: Multiprogramming Major Innovation: Multiprogramming three jobs in memory at once 20 CS 533 Karavanic

21 Multiprogrammed Batch Systems: The Insight I/O operations are exceedingly slow (compared to instruction execution) A program containing even a very small number of I/O ops, will spend most of its time waiting for them Hence: poor CPU usage when only one program is present in memory 21 CS 533 Karavanic

22 Multiprogrammed Batch Systems: The Insight If memory can hold several programs, then CPU can switch to another one whenever a program is awaiting for an I/O to complete This is multitasking (multiprogramming) 22 CS 533 Karavanic

23 Requirements for Multiprogramming Hardware support: I/O interrupts and (possibly) DMA in order to execute instructions while I/O device is busy Memory management several ready-to-run jobs must be kept in memory Memory protection (data and programs) Software support from the OS: Scheduling (which program should run next? ) Resource Management ( contention, request ordering) 23 CS 533 Karavanic

24 The Single Core Era: Multiprogramming Time Sharing Systems (TSS) Batch multiprogramming does not support interaction with users TSS extends multiprogramming to handle multiple interactive jobs Single core Processor s time is shared among multiple users Multiple users simultaneously access the system through terminals Because of slow human reaction time, a typical user needs 2 sec of processing time per minute Then (about) 30 users should be able to share the same system without noticeable delay in the computer reaction time NOW: The file system must be protected (multiple users ) Pressure to improve user interface, connection speeds CS 333 Winter

25 The Single Core Era: Multiprogramming Time Sharing Systems (TSS) Because of slow human reaction time, a typical user needs 2 sec of processing time per minute Then (about) 30 users should be able to share the same system without noticeable delay in the computer reaction time BUT The file system must be protected (multiple users ) CS 333 Winter

26 The Single Core Era Fourth Generation Computing ( s) Hardware Innovation: [Very] Large Scale Integration ([V]LSI) Platform Innovation: microcomputers / PCs Innovation: GUI (XEROX), X Innovation: fast networks, internet, WWW Programming: Fortran, assembly language Operating Systems: CP/M, DOS, Windows, Linux CS 333 Winter

27 The Single Core Era Key Hardware Advances Instruction Level Parallelism (ILP) Pipelining Branch Prediction Multiple Instruction Issue The Memory Hierarchy CS 533 Karavanic 27

28 Pipelining The Insight n Formulate instruction execution as sequence of simple steps n Use same general form for all instructions n Design hardware so that a different instruction can be at each step concurrently n 1. Instr 1 at stage 1 n 2. Instr 1 at stage 2, Instr 2 at stage 1 n 3. Instr 1 at stage 3, Instr 2 at stage 2, Instr 3 at stage 1 n 28 CS:APP2e

29 Real-World Pipelines: Car Washes Sequential Parallel Pipelined Idea n Divide process into independent stages n Move each car through stages in sequence n At any given time, multiple cars being processed 29 CS:APP2e

30 Computational Example 300 ps 20 ps Combinational logic R e g Delay = 320 ps Throughput = 3.12 GIPS Clock System n Computation requires total of 300 picoseconds n Additional 20 picoseconds to save result in register n Must have clock cycle of at least 320 ps 30 CS:APP2e

31 3-Way Pipelined Version 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Delay = 360 ps Throughput = 8.33 GIPS System n Divide combinational logic into 3 blocks of 100 ps each n Can begin new operation as soon as previous one passes through stage A. l Begin new operation every 120 ps n Overall latency increases l 360 ps from start to finish Clock 31 CS:APP2e

32 Pipeline Diagrams Unpipelined OP1 OP2 OP3 Time n Cannot start new operation until previous one completes 3-Way Pipelined OP1 OP2 OP3 A B C A B C A B C Time n Up to 3 operations in process simultaneously 32 CS:APP2e

33 Limitations: Nonuniform Delays 50 ps 20 ps 150 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Delay = 510 ps Throughput = 5.88 GIPS OP1 OP2 OP3 Clock A B C A B C A B C Time n Throughput limited by slowest stage n Other stages sit idle for much of the time n Challenging to partition system into balanced stages 33 CS:APP2e

34 Limitations: Register Overhead 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Clock Delay = 420 ps, Throughput = GIPS n As try to deepen pipeline, overhead of loading registers becomes more significant n Percentage of clock cycle spent loading register: l 1-stage pipeline: 6.25% l 3-stage pipeline: 16.67% l 6-stage pipeline: 28.57% n High speeds of modern processor designs obtained through very deep pipelining 34 CS:APP2e

35 Pipelining Challenges: Data Dependencies Combinational logic R e g Clock OP1 OP2 OP3 Time System n Each operation depends on result from preceding one 35 CS:APP2e

36 Pipelining Challenges: Data Hazards Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g OP1 A B C OP2 A B C OP3 A B C OP4 A B C Time Clock n Result does not feed back around in time for next operation n Pipelining has changed behavior of system 36 CS:APP2e

37 Data Dependencies in Processors 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100( %ebx ), %edx n Result from one instruction used as operand for another l Read-after-write (RAW) dependency n Very common in actual programs n Must make sure our pipeline handles these properly l Get correct results l Minimize performance impact 37 CS:APP2e

38 Pipeline Stages Fetch n Select current PC n Read instruction n Compute incremented PC Decode n Read program registers Execute n Operate ALU Memory n Read or write data memory Write Back n Update register file 38 CS:APP2e

reliably determine next instruction n Guess which

39 Predicting the PC n Start fetch of new instruction after current one has completed fetch stage l Not enough time to reliably determine next instruction n Guess which instruction will follow l Recover if prediction was incorrect 39 CS:APP2e

40 Example: Prediction Strategy Instructions that Don t Transfer Control n Predict next PC to be valp n Always reliable Call and Unconditional Jumps n Predict next PC to be valc (destination) n Always reliable Conditional Jumps n Predict next PC to be valc (destination) n Only correct if branch is taken l Typically right 60% of time Return Instruction n Don t try to predict 40 CS:APP2e

41 Pipeline Summary Concept n Break instruction execution into 5 stages n Run instructions through in pipelined mode Limitations n Can t handle dependencies between instructions when instructions follow too closely n Data dependencies l One instruction writes register, later one reads it n Control dependency l Instruction sets PC in way that pipeline did not predict correctly l Mispredicted branch and return 41 CS:APP2e

42 Carnegie Mellon Superscalar Processor Definition: A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically. Benefit: without programming effort, superscalar processor can take advantage of the instruction level parallelism that most programs have Most CPUs since about 1998 are superscalar. Intel: since Pentium Pro 42

43 Carnegie Mellon Superscaler example: Nehalem CPU Multiple instructions can execute in parallel 1 load, with address computation 1 store, with address computation 2 simple integer (one may be branch) 1 complex integer (multiply/divide) 1 FP Multiply 1 FP Add Some instructions take > 1 cycle, but can be pipelined Instruction Latency Cycles/Issue Load / Store 4 1 Integer Multiply 3 1 Integer/Long Divide Single/Double FP Multiply 4/5 1 Single/Double FP Add 3 1 Single/Double FP Divide

44 Carnegie Mellon Loop Unrolling void unroll2a_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x OP d[i]) OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } Perform 2x more useful work per iteration 44

45 Carnegie Mellon Effect of Loop Unrolling Method Integer Double FP Operation Add Mult Add Mult Combine Unroll 2x Latency Bound Helps integer multiply below latency bound Compiler does clever optimization Others don t improve. Why? Still sequential dependency x = (x OP d[i]) OP d[i+1]; 45

46 Carnegie Mellon What About Branches? Challenge Instruction Control Unit must work well ahead of Execution Unit to generate enough operations to keep EU busy 80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx Executing How to continue? When encounters conditional branch, cannot reliably determine where to continue fetching 46

47 Carnegie Mellon Branch Outcomes When encounter conditional branch, cannot determine where to continue fetching Branch Taken: Transfer control to branch target Branch Not-Taken: Continue with next instruction in sequence Cannot resolve until outcome determined by branch/integer unit 80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx Branch Not-Taken 8048a25: cmpl %edi,%edx 8048a27: jl 8048a a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax) Branch Taken 47

48 Carnegie Mellon Branch Prediction Idea Guess which way branch will go Begin executing instructions at predicted position But don t actually modify register or memory data 80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25... Predict Taken 8048a25: cmpl %edi,%edx 8048a27: jl 8048a a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax) Begin Execution 48

49 Branch Prediction Through Loop 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx i = b9: jl 80488b b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx i = b9: jl 80488b b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = 101 Assume vector length = 100 Predict Taken (OK) Predict Taken (Oops) Read invalid location Executed Fetched Carnegie Mellon 49

50 Carnegie Mellon Branch Misprediction Invalidation 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx i = b9: jl 80488b b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx i = b9: jl 80488b b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx i = 101 Assume vector length = 100 Predict Taken (OK) Predict Taken (Oops) Invalidate 50

51 Carnegie Mellon Branch Misprediction Recovery 80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 i = bb: leal 0xffffffe8(%ebp),%esp 80488be: popl %ebx 80488bf: popl %esi 80488c0: popl %edi Definitely not taken Performance Cost Multiple clock cycles on modern processor Can be a major performance limiter 51

52 The Single Core Era The Memory Hierarchy Why have a hierarchy of memory? How does it work? CS 533 Karavanic 52

53 Carnegie Mellon SRAM Storage Trends Metric :1980 $/MB 19,200 2, access (ns) DRAM Metric :1980 $/MB 8, ,000 access (ns) typical size (MB) ,000 8, ,000 Disk Metric :1980 $/MB ,600,000 access (ms) typical size (MB) ,000 20, ,000 1,500,0001,500,000 53

54 Carnegie Mellon The CPU-Memory Gap The gap between DRAM, disk, and CPU speeds. 100,000, ,000,000.0 Disk 1,000, ,000.0 SSD ns 10, , DRAM Disk seek time Flash SSD access time DRAM access time SRAM access time CPU cycle time Effective CPU cycle time CPU Year 54

55 Carnegie Mellon Locality to the Rescue! The key to bridging this CPU-Memory gap is a fundamental property of computer programs known as locality 55

56 Carnegie Mellon Locality Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in time 56

57 Carnegie Mellon Locality Example sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data references Reference array elements in succession (stride-1 reference pattern). Reference variable sum each iteration. Instruction references Reference instructions in sequence. Cycle through loop repeatedly. Spatial locality Temporal locality Spatial locality Temporal locality 57

58 Carnegie Mellon Memory Hierarchies Some fundamental and enduring properties of hardware and software: Fast storage technologies cost more per byte, have less capacity, and require more power (heat!). The gap between CPU and main memory speed is widening. Well-written programs tend to exhibit good locality. These fundamental properties complement each other beautifully. They suggest an approach for organizing memory and storage systems known as a memory hierarchy. 58

59 Carnegie Mellon An Example Memory Hierarchy L0: Registers CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte Larger, slower, cheaper per byte L4: L3: L2: L1: L1 cache (SRAM) L2 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers L5: Remote secondary storage (tapes, distributed file systems, Web servers) 59

60 Carnegie Mellon Caches Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy: For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Why do memory hierarchies work? Because of locality, programs tend to access the data at level k more often than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top. 60

61 Carnegie Mellon General Cache Concepts Cache Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory Larger, slower, cheaper memory viewed as partitioned into blocks 61

62 Carnegie Mellon General Cache Concepts: Hit Cache Request: Data in block b is needed Block b is in cache: Hit! Memory

63 Carnegie Mellon General Cache Concepts: Miss Cache Request: Data in block b is needed Block b is not in cache: Miss! 12 Request: 12 Block b is fetched from memory Memory Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) 63

64 Carnegie Mellon General Caching Concepts: Types of Cache Misses Cold (compulsory) miss Cold misses occur because the cache is empty. Conflict miss Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8,... would miss every time. Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cache. 64

65 Carnegie Mellon Examples of Caching in the Hierarchy Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By Registers 4-8 bytes words CPU core 0 Compiler TLB Address translations On-Chip TLB 0 Hardware L1 cache L2 cache Virtual Memory Buffer cache Network buffer cache Browser cache 64-bytes block 64-bytes block 4-KB page Parts of files Parts of files Web pages On-Chip L1 On/Off-Chip L2 Main memory Main memory Local disk Local disk ,000,000 10,000,000 Hardware Hardware Hardware + OS Disk cache Disk sectors Disk controller 100,000 Disk firmware 100 OS AFS/NFS client Web browser Web cache Web pages Remote server disks 1,000,000,000 Web proxy server 65

66 Carnegie Mellon The Memory Hierarchy: Summary The speed gap between CPU, memory and mass storage continues to widen. Well-written programs exhibit a property called locality. Memory hierarchies based on caching close the gap by exploiting locality. 66

67 Carnegie Mellon Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory. Typical single core system structure: CPU chip Register file Cache memories ALU System bus Memory bus Bus interface I/O bridge Main memory 67

68 The Single Core Era: Other Advances Mobile devices Innovation: Fast Pervasive wireless and cellular Innovation: Power Management Embedded Systems Is your car on the internet? Your toaster? Open Source Accessibility of OS, rate of upgrades to OS Virtualization Innovation: New HW support for Virtualization Cloud Computing Multicore and Manycore Processors disruptive technologies CS 333 Winter

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of

69 Technology Trends: Microprocessor Capacity Moore s Law 2X transistors/chip Every 1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra CS 533 Karavanic 69

70 Microprocessor Transistors and Clock Rate Growth in transistors per chip Increase in clock rate 100,000, Transistors 10,000,000 1,000, ,000 i80286 i80386 R3000 R2000 R10000 Pentium Clock Rate (MHz) i ,000 i8080 1,000 i Year Year Why bother with parallel programming? Just wait a year or two CS 533 Karavanic 70

71 Limit #1: Power density Can soon put more transistors on a chip than can afford to turn on. -- Patterson 07 Power Density (W/cm 2 ) Scaling clock speed (business as usual) will not work Nuclear Reactor Hot Plate Rocket Nozzle P6 Pentium Year Sun s Surface Source: Patrick Gelsinger, Intelâ CS 533 Karavanic 71

72 Parallelism Saves Power Exploit explicit parallelism for reducing power Power = C (C 2C * V 22 /4 * FF)/4F * F/2 Performance = (Cores 2Cores * FF)*1 F/2 Capacitance Voltage Frequency Using additional cores Increase density (= more transistors = more capacitance) Can increase cores (2x) and performance (2x) Or increase cores (2x), but decrease frequency (1/2): same performance at ¼ the power Additional benefits Small/simple cores à more predictable performance CS 533 Karavanic 72

73 Limit #2: Hidden Parallelism Tapped Out Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006??%/year VAX: 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to %/year 52%/year ½ due to transistor density ½ due to architecture changes, e.g., Instruction Level Parallelism (ILP) CS 533 Karavanic 73

74 Limit #2: Hidden Parallelism Tapped Out Superscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer multiple instruction issue dynamic scheduling: hardware discovers parallelism between instructions speculative execution: look past predicted branches non-blocking caches: multiple outstanding memory ops Unfortunately, these sources have been used up CS 533 Karavanic 74

75 Uniprocessor Performance (SPECint) Today Performance (vs. VAX-11/780) From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, %/year 52%/year??%/year 2x every 5 years? Þ Sea change in chip design: multiple cores or processors per chip 3X VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86:??%/year 2002 to present CS 533 Karavanic 75

can help More smaller, simpler processors are easier to design and validate Can use

76 Limit #3: Chip Yield Manufacturing costs and yield problems limit use of density Moore s (Rock s) 2 nd law: fabrication costs go up Yield (% usable chips) drops Parallelism can help More smaller, simpler processors are easier to design and validate Can use partially working chips: E.g., Cell processor (PS3) is sold with 7 out of 8 on to improve yield

77 Limit #4: Speed of Light (Fundamental) 1 Tflop/s, 1 Tbyte sequential machine r = 0.3 mm Consider the 1 Tflop/s sequential machine: Data must travel some distance, r, to get from memory to CPU. To get 1 data element per cycle, this means times per second at the speed of light, c = 3x10 8 m/s. Thus r < c/10 12 = 0.3 mm. Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: Each bit occupies about 1 square Angstrom, or the size of a small atom. No choice but parallelism CS 533 Karavanic 77

Chip density is continuing increase ~2x every 2 years* Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be

78 Chip density is continuing increase ~2x every 2 years* Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Thus, the Multicore Era Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) CS 533 Karavanic 78

79 Carnegie Mellon Intel Core i7 Cache Hierarchy Processor package Core 0 Regs Core 3 Regs L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles L1 d-cache L1 i-cache L1 d-cache L1 i-cache L2 unified cache: 256 KB, 8-way, Access: 11 cycles L2 unified cache L2 unified cache L3 unified cache: 8 MB, 16-way, Access: cycles L3 unified cache (shared by all cores) Block size: 64 bytes for all caches. Main memory 79

80 What is Manycore? What if we use all of the transisters on a chip for as many cores as we can fit?? Beyond the edge of number of cores in common multicore architectures Dividing line is not clearly defined Active research, now in embedded & clusters Examples: NVIDIA Fermi Graphics Processing Unit (GPU) First model: 32 CUDA cores per SM, 16 SMs (SM = streaming multiprocessor ) K20 model: 2496 CUDA cores, peak 3.52 Tflops CS 533 Karavanic 80

81 Ex: NVIDIA Fermi CS 533 Karavanic 81

82 Examples (cont d) What is Manycore? Intel Xeon Phi coprocessor and Knights Landing Up to 61 cores Example: Tianhe-2 Supercomputer has 32,000 multicore central processing units (CPUs) and 48,000 coprocessors ( accelerators ); peak Petaflops Tilera Tile-Gx 100 cores, 64-bit CS 533 Karavanic 82

83 Internet of Things Quantum Computing Small Devices Exascale Computing Current Research CS 533 Karavanic 83

The First Operating System Was Human

The First Operating System Was Human CS 333 Professor Karavanic Lecture 1 1. What is an Operating System? Historical View 2. Course Information 3. What is an Operating System? Operating Systems Goals and