The History of I/O Models Erik Demaine

Size: px

Start display at page:

Download "The History of I/O Models Erik Demaine"

Sabrina Cooper
5 years ago
Views:

1 The History of I/O Models Erik Demaine MASSACHUSETTS INSTITUTE OF TECHNOLOGY

2 Memory Hierarchies in Practice CPU 750 ps Registers 100B Level 1 Cache 100KB Level 2 Cache 1MB 10GB 14 ns Main Memory 1EB-1ZB 4 ms Disk Internet 1TB Outer Space < 10 83

3 Models, Models, Models Model Year Blocking Caching Levels Simple Idealized 2-level Red-blue pebble External memory HMM 1987 BT 1987 ~ (U)MH 1990 Cache oblivious

4 Idealized Two-Level Storage [Floyd Complexity of Computer Computations 1972] PDP-11 ( s) Dennis Ritchie 2.4MB RK03 disks Ken Thompson 8KB core memory

5 Idealized Two-Level Storage [Floyd Complexity of Computer Computations 1972] RAM = blocks of BB items Block operation: Read up to BB items from two blocks ii, jj Write to third block kk Ignore item order within block CPU operations considered free Items are indivisible CPU BB items ii jj kk

6 Permutation Lower Bound [Floyd Complexity of Computer Computations 1972] Theorem: Permuting NN items to NN BB (full) specified blocks needs BB items ii Ω NN BB log BB CPU jj block operations, in average case kk Assuming NN BB > BB (tall disk) Simplified model: Move items instead of copy Equivalence: Follow item s path from start to finish

7 Permutation Lower Bound [Floyd Complexity of Computer Computations 1972] Potential: Φ = nn iiii log nn iiii ii,jj BB items ii # items in block ii destined for block jj Maximized in target configuration of full blocks (nn iiii =BB): Φ = NN log BB Random configuration with NN BB > BB jj kk has E nn iiii = OO(1) E Φ = OO(NN) Claim: Block operation increases Φ by BB Number of block operations NN log BB OO(NN) BB

8 Permutation Lower Bound [Floyd Complexity of Computer Computations 1972] Potential: Φ = nn iiii log nn iiii ii,jj BB items ii # items in block ii destined for block jj Maximized in target configuration of full blocks (nn iiii =BB): Φ = NN log BB Random configuration with NN BB > BB jj kk has E nn iiii = OO(1) E Φ = OO(NN) Claim: Block operation increases Φ by BB o Fact: xx + yy log xx + yy xx log xx + yy log yy + xx + yy o So combining groups xx & yy increases Φ by xx + yy

9 Permutation Bounds [Floyd Complexity of Computer Computations 1972] Theorem: Ω NN BB log BB Tight for BB = OO(1) Theorem: OO NN BB log NN BB Similar to radix sort, where key = target block index CPU BB items ii jj kk Accidental claim: tight for all BB < NN BB We will see: tight for BB > log NN BB

10 Idealized Two-Level Storage [Floyd Complexity of Computer Computations 1972] External memory & word RAM: BB items ii jj kk

11 Idealized Two-Level Storage [Floyd Complexity of Computer Computations 1972] External memory & word RAM: BB items ii CPU jj Foreshadowing future models: kk

12 Red-Blue Pebble Game [Hong & Kung STOC 1981] TRS-80 [ ] CPU cache MM items disk 4KB RAM Apple II [ ] 5KB RAM VIC-20 [ ] 4KB RAM

13 Pebble Game [Hopcroft, Paul, Valiant J. ACM 1977] View computation as DAG of data dependencies Pebble = in memory Moves: Place pebble on node if all predecessors have a pebble Remove pebble from node inputs: outputs: Goal: Pebbles on all output nodes Minimize maximum number of pebbles over time

14 Pebble Game [Hopcroft, Paul, Valiant J. ACM 1977] Theorem: Any DAG can be executed using OO maximum pebbles nn log nn inputs: Corollary: DTIME tt DSPACE tt log tt outputs:

15 Red-Blue Pebble Game [Hong & Kung STOC 1981] Red pebble = in cache inputs: Blue pebble = on disk Moves: Place red pebble on node if all predecessors have red pebble outputs: Remove pebble from node Write: Red pebble blue pebble minimize Read: Blue pebble red pebble Goal: Blue inputs to blue outputs MM red pebbles at any time

16 Red-Blue Pebble Game [Hong & Kung STOC 1981] Red pebble = in cache Blue pebble = on disk inputs: cache disk outputs: CPU MM items minimize number of cache disk I/Os (memory transfers)

Red-Blue Pebble Game Results [Hong & Kung STOC 1981] Computation DAG Fast Fourier Transform (FFT) Ordinary matrixvector multiplication Memory Transfers Θ NN log MM NN Θ

17 Red-Blue Pebble Game Results [Hong & Kung STOC 1981] Computation DAG Fast Fourier Transform (FFT) Ordinary matrixvector multiplication Memory Transfers Θ NN log MM NN Θ NN2 MM Speedup Θ log MM Θ MM Ordinary matrixmatrix multiplication Θ NN3 MM Θ MM Odd-even transposition sort Θ NN2 MM Θ MM NN NN NN grid dd Ω NN dd MM 1/(dd 1) Θ MM 1/(dd 1)

18 Comparison Idealized twolevel storage [Floyd 1972] Red-blue pebble game [Hong & Kung 1981] BB items cache disk CPU CPU MM items blocks but no cache cache but no blocks

19 I/O Model [Aggarwal & Vitter ICALP 1987, C. ACM 1988] AKA: External Memory Model, Disk Access Model Goal: Minimize number of I/Os (memory transfers) cache BB items disk BB items CPU MM BB blocks cache and blocks

20 Scanning [Aggarwal & Vitter ICALP 1987, C. ACM 1988] Visiting NN elements in order costs OO 1 + NN BB memory transfers More generally, can run MM parallel scans, BB keeping 1 block per scan in cache E.g., merge OO MM BB in OO 1 + NN BB lists of total size NN memory transfers BB items

21 Practical Scanning [Arge] Does the BB factor matter? Should I presort my linked list before traversal? Example: NN = 256,000,000 ~ 1GB BB = 8,000 ~ 32KB (small) 1ms disk access time (small) NN memory transfers take 256,000 sec 71 hours NN BB memory transfers take = 3333 seconds

22 Searching [Aggarwal & Vitter ICALP 1987, C. ACM 1988] Finding an element xx among NN items requires Θ log BB+1 NN memory transfers Lower bound: (comparison model) Each block reveals where xx fits among BB items Learn log BB + 1 bits per read Need log NN + 1 bits Upper bound: B-tree Insert & delete in OO log BB+1 NN BB items

23 Sorting and Permutation [Aggarwal & Vitter ICALP 1987, C. ACM 1988] Sorting bound: Θ NN BB log MM BB Permutation bound: Θ min NN, NN BB log MM BB Either sort or use naïve RAM algorithm Solves Floyd s two-level storage problem (MM = 3BB) NN BB BB items NN BB

24 Sorting Lower Bound [Aggarwal & Vitter ICALP 1987, C. ACM 1988] Sorting bound: Ω NN BB log MM BB BB Always keep cache sorted (free) Might as well presort each block Upon reading a block, learn how those BB items fit amongst MM items in cache Learn lg MM+BB BB lg MM BB bits BB Need lg NN! NN lg NN bits Know NN lg BB bits from block presort NN BB items

25 Sorting Upper Bound [Aggarwal & Vitter ICALP 1987, C. ACM 1988] Sorting bound: OO NN BB log MM BB NN BB OO MM BB -way mergesort TT NN = MM BB TT NN MM BB + OO 1 + NN BB TT BB = OO 1 NN log MM BB BB levels MM/BB NN/BB NN/BB NN/BB BB items NN/BB

26 Distribution Sort [Aggarwal & Vitter ICALP 1987, C. ACM 1988] MM BB-way quicksort 1. Find MM BB partition elements, roughly evenly spaced 2. Partition array into MM BB + 1 pieces Scan: OO NN memory transfers BB 3. Recurse Same recurrence as mergesort BB items

27 Distribution Sort Partitioning [Aggarwal & Vitter ICALP 1987, C. ACM 1988] 1. For first, second, interval of MM items: Sort in OO MM BB memory transfers Sample every 1 MM BB th item 4 Total sample: 4NN/ MM BB items 2. For ii = 1,2,, MM BB: Run linear-time selection to find sample element at ii MM BB fraction Cost: OO NN MM BB BB each Total: OO NN BB memory transf. BB items

28 Random vs. Sequential I/Os [Farach, Ferragina, Muthukrishnan FOCS 1998] Sequential memory transfers are part of bulk read/write of Θ MM items Random memory transfer otherwise Sorting: 2-way mergesort achieves OO NN BB log NN BB sequential oo NN BB log MM BB Same trade-off for suffix-tree construction NN BB random implies Ω NN BB log NN BB total BB items

Hierarchical Memory Model (HMM) [Aggarwal, Alpern, Chandra, Snir STOC 1987] Nonuniform-cost RAM: Accessing memory location xx costs ff xx = log xx CPU 1 1 1 1 1

29 Hierarchical Memory Model (HMM) [Aggarwal, Alpern, Chandra, Snir STOC 1987] Nonuniform-cost RAM: Accessing memory location xx costs ff xx = log xx CPU ii particularly simple model of computation that mimics the behavior of a memory hierarchy consisting of increasingly larger amounts of slower memory

30 Why ff xx = llllll xx? [Mead & Conway 1980]

31 HMM Upper & Lower Bounds [Aggarwal, Alpern, Chandra, Snir STOC 1987] Problem Time Slowdown Semiring matrix multiplication Fast Fourier Transform Θ NN 3 Θ(NN log NN log log NN) Θ(1) Θ log log NN Sorting Θ(NN log NN log log NN) Θ log log NN Scanning input (sum, max, DFS, planarity, etc.) Θ NN log NN Θ log NN Binary search Θ log 2 NN Θ log NN

0 ( polynomially bounded ) Write ff xx = ii ww ii [xx xx ii?

32 HHHHMM ff(xx) [Aggarwal, Alpern, Chandra, Snir STOC 1987] Say accessing memory location xx costs ff xx Assume ff 2xx cc ff(xx) for a constant cc > 0 ( polynomially bounded ) Write ff xx = ii ww ii [xx xx ii? ] (weighted sum of threshold functions) xx ii ww 1 ww 2 0 ww 0 CPU xx 0 xx 1 xx 0 xx 2 xx 1

33 Uniform Optimality [Aggarwal, Alpern, Chandra, Snir STOC 1987] Consider one term ff MM xx = [xx MM? ] Algorithm is uniformly optimal if optimal on HMM ffmm xx for all MM Implies optimality for all ff xx cache disk red-blue pebble game! MM CPU 0 CPU MM 1 MM items

34 HHHHMM ffmm xx Upper & Lower Bounds [Aggarwal, Alpern, Chandra, Snir STOC 1987] Problem Time Speedup Semiring matrix multiplication Fast Fourier Transform Θ NN3 MM Θ NN log MM NN Θ MM Θ log MM Sorting Θ NN log MM NN Θ log MM Scanning input (sum, max, DFS, planarity, etc.) Θ NN MM upper bounds known by Hong & Kung 1981 other bounds follow from Aggarwal & Vitter /MM Binary search Θ log NN log MM 1 + 1/ log MM

35 Implicit HMM Memory Management [Aggarwal, Alpern, Chandra, Snir STOC 1987] Instead of algorithm explicitly moving data, use any conservative replacement strategy (e.g., FIFO or LRU) to evict from cache [Sleator & Tarjan C. ACM 1985] TT LRU NN, MM 2 TT OPT NN, MM 2 = OO TT OPT NN, MM assuming ff 2xx cc ff(xx) Not uniform! 0 CPU MM 1

36 Implicit HMM Memory Management [Aggarwal, Alpern, Chandra, Snir STOC 1987] For general ff, split memory into chunks at xx where ff(xx) doubles (up to rounding) ww 1 ww 2 0 ww 0 CPU xx 0 xx 1 xx 0 xx 2 xx 1

37 Implicit HMM Memory Management [Aggarwal, Alpern, Chandra, Snir STOC 1987] For general ff, split memory into chunks at xx where ff(xx) doubles (up to rounding) LRU eviction from first chunk into second; LRU eviction from second chunk into third; etc. TT LLLLLL NN = OO TT OOOOOO NN + NN ff NN Like MTF CPU ii

38 HMM with Block Transfer (BT) [Aggarwal, Chandra, Snir FOCS 1987] Accessing memory location xx costs ff xx Copying memory interval from xx δδ xx to yy δδ yy costs ff max xx, yy + δδ Models memory pipelining ~ block transfer Ignores block alignment, explicit levels, etc. CPU ii

39 BT Results [Aggarwal, Chandra, Snir FOCS 1987] Problem Dot product, merging lists ff xx = log xx ff xx = xx αα, 0 < αα < 1 ff xx = xx ff xx = xx αα, αα > 1 Θ NN log NN Θ NN log log NN Θ(NN log NN) Θ NN αα Matrix mult. Θ NN 3 Θ NN 3 Θ NN 3 Θ NN αα Fast Fourier Transform if αα > 1.5 Θ NN log NN Θ NN log NN Θ NN log 2 NN Θ NN αα Sorting Θ NN log NN Θ NN log NN Θ NN log 2 NN Θ NN αα Binary search Θ log2 NN log log NN Θ NN αα Θ NN Θ NN αα

40 Memory Hierarchy Model (MH) [Alpern, Carter, Feig, Selker FOCS 1990] Multilevel version of external-memory model MM ii MM ii+1 transfers happen in blocks of size BB ii (subblocks of MM ii+1 ), and take tt ii time All levels can be actively transferring at once BB 0 BB 1 BB 2 BB 3 BB 4 CPU 0 tt 0 tt 1 tt 2 tt 3 MM 0 MM 1 MM 2 MM 3 MM 4

41 Uniform Memory Hierarchy (UMH) [Alpern, Carter, Feig, Selker FOCS 1990] Fix aspect ratio αα = MM/BB, block growth ββ = BB ii+1 BB BB ii BB ii = ββ ii MM ii BB ii = αα ββ ii tt ii = ββ ii ff(ii) 2 parameters 1 function CPU BB 0 BB 1 BB 2 BB 3 BB 4 0 tt 0 tt 1 tt 2 tt 3 MM 0 MM 1 MM 2 MM 3 MM 4

42 UMH Results [Alpern, Carter, Feig, Selker FOCS 1990] Problem Upper Bound Lower Bound Matrix transpose ff ii = 1 OO ββ 2 NN2 Ω αααα 4 Matrix mult. ff ii = O ββ ii OO ββ 3 NN3 Ω ββ 3 NN3 NN2 FFT ff ii ii OO 1 Ω 1 General approach: Divide & conquer

43 Cache-Oblivious Model [Frigo, Leiserson, Prokop, Ramachandran FOCS 1999] Analyze RAM algorithm (not knowing BB or MM) on external-memory model Must work well for all BB and MM cache BB items disk BB items CPU MM BB blocks

44 Cache-Oblivious Model [Frigo, Leiserson, Prokop, Ramachandran FOCS 1999] Automatic block transfers via LRU or FIFO Lose factor of 2 in MM and number of transfers Assume TT BB, 2MM cc TT(BB, MM) cache BB items disk BB items CPU MM BB blocks

45 Cache-Oblivious Model [Frigo, Leiserson, Prokop, Ramachandran FOCS 1999] Clean model Adapts to changing BB (e.g., disk tracks) and changing MM (e.g., competing processes) Adapts to multilevel memory hierarchy (MH) Assuming inclusion BB 0 BB 1 BB 2 BB 3 BB 4 CPU 0 tt 0 tt 1 tt 2 tt 3 MM 0 MM 1 MM 2 MM 3 MM 4

46 Scanning [Frigo, Leiserson, Prokop, Ramachandran FOCS 1999] Visiting NN elements in order costs OO 1 + NN memory transfers BB More generally, can run OO 1 parallel scans Assume MM cc BB for appropriate constant cc > 0 E.g., merge two lists in OO NN BB BB items

47 Searching [Prokop Meng 1999] van Emde Boas layout

48 Searching [Prokop Meng 1999] height( ) 1 lg BB 2 2 memory transfers per 4 log BB NN total

49 Cache-Oblivious Searching lg ee + oo 1 log BB NN is optimal [Bender, Brodal, Fagerberg, Ge, He, Hu, Iacono, López-Ortiz FOCS 2003] Dynamic B-tree in OO log BB NN per operation [Bender, Demaine, Farach-Colton FOCS 2000] [Bender, Duan, Iacono, Wu SODA 2002] [Brodal, Fagerberg, Jacob SODA 2002]

50 Cache-Oblivious Sorting OO NN log BB MM BB MM Ω BB 1+εε NN BB Funnel sort: mergesort analog Distribution sort possible, assuming (tall cache) [Frigo, Leiserson, Prokop, Ramachandran FOCS 1999; Brodal & Fagerberg ICALP 2002] Impossible without tall-cache assumption [Brodal & Fagerberg STOC 2003]

52 Models, Models, Models Model Year Blocking Caching Levels Simple Idealized 2-level Red-blue pebble External memory HMM 1987 BT 1987 ~ (U)MH 1990 Cache oblivious

Lecture 19 Apr 25, 2007

6.851: Advanced Data Structures Spring 2007 Prof. Erik Demaine Lecture 19 Apr 25, 2007 Scribe: Aditya Rathnam 1 Overview Previously we worked in the RA or cell probe models, in which the cost of an algorithm