System and Algorithmic Adaptation for Flash

System and Algorithmic Adaptation for Flash The FAWN Perspective David G. Andersen, Vijay Vasudevan, Michael Kaminsky* Amar Phanishayee, Jason Franklin, Iulian Moraru, Lawrence Tan Carnegie Mellon University and *Intel Labs

Context: Datacenter Energy Hydroelectric Dam 2

Approaches to saving power Infrastructure Efficiency Dynamic Power Scaling Computational Efficiency Power generation Power distribution Cooling Sleeping when idle Rate adaptation VM consolidation FAWN Goal of computational efficiency: Reduce the amount of energy to do useful work 3

FAWN Fast Array of Wimpy Nodes Improve computational efficiency of data-intensive computing using an array of well-balanced low-power systems. 34-5"6"78-, 9&4:&4 ()* ()* ()* ()*!"#$ %&' +;1< ()* %&' +,-#. ()* %&' +,-#. 1.6Ghz single/dual Intel Pineview ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. atom 2GB DRAM AMD Geode 256MB DRAM 4GB CompactFlash Intel X25-m/e SSD /001 201 4

Towards balanced systems 1E+08 1E+07 Disk Seek Rebalancing Options Nanoseconds 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 Wasted resources DRAM Access 1E+00 CPU Cycle 1E-01 1980 1985 1990 1995 2000 2005 Year Today s CPUs Array of Fastest Disks Slower CPUs Fast Storage Slow CPUs Today s Disks 5

Targeting the sweet-spot in efficiency 2500 Speed vs. Efficiency Fastest processors exhibit superlinear power usage Instructions/sec/W in millions 2000 1500 1000 500 Custom ARM Mote XScale 800Mhz Atom Z500 Xeon7350 Fixed power costs can dominate efficiency for slow processors FAWN targets sweet spot in system efficiency when including fixed costs 0 1 10 100 1000 10000 100000 Instructions/sec in millions (Includes 0.1W power overhead) 6

Targeting the sweet-spot in efficiency Instructions/sec/W in millions FAWN 0 500 1000 1500 2000 2500 Today s CPU Array of Fastest Disks Slower CPU Fast Storage Slow CPU Today s Disk 7 Instructions/sec in millions 1 10 100 1000 10000 100000 XScale 800Mhz Custom ARM Mote Atom Z500 Xeon7350 More efficient

Case 1: A high-performance, persistent key-value store ~20 byte keys 10-1000 byte values Very small writes Irregular size objects Very random access The FTL is not our friend. 8

Using Berkeley DB on CF Platform: 500Mhz AMD Geode, 256MB DRAM, 4GB Compact Flash Card Insert 7M 200-byte entries into DB BDB FAWN-KV 0.07 MB/s 20 MB/s

Using Flash for K-V Write sequentially within an erase block Can do this concurrently to several, iff the FTL lets you (Duplication w/filesystem...) Use system memory efficiently Otherwise, why use Flash at all? :)

From key to value KeyFrag!= Key Potential collisions! Low probability of multiple Flash reads 160-bit key DRAM Hashtable Hash Index Flash Data region Log Entry Key Len Data KeyFrag Valid. { 12 bytes per entry Offset (a) 11

Just one log is painful With flash, not restricted to one -- maybe Write Speed in MB/s 100 90 80 70 60 50 40 30 20 10 0 1 2 4 8 16 32 64 128 256 Sandisk Extreme IV Memoright GT Mtron Mobi Number of FAWNDS Files (Log-scale) 12 Intel X25-M Intel X25-E

FAWN-DS Lookups System QPS Watts Our FAWN-based system over 6x more efficient than 2008-era traditional systems 13 QPS Watt Alix3c2/Sandisk(CF) 1298 3.75 346 Desktop/Mobi (SSD) 4289 83 51.7 MacbookPro / HD 66 29 2.3 Desktop / HD 171 87 1.96

Ongoing work DRAM limits amount of Flash that can be used. FAWN-KV: 12 bytes per entry Our ongoing work gets this down to ~1 byte DRAM per key-value entry (but must re-write data once), or 3 bits if can read flash on table miss BufferHash (NSDI 2010) provides similar benefits, though wastes 50% of flash space 14

And then we moved to Atom + SSD 1.6Ghz single-core Pineview, 2GB DRAM, x25-m SSD 2.8Ghz 4-core i7, 2GB DRAM, 6x (x25-m SSD) dual 2.8Ghz 4-core Xeon, 8GB DRAM, FusionIO 15

512 b random reads Platform 2x 4-core xeon + FusionIO i7 + single X25-m i7 + 4x X25-m Atom-1core + X25-m Reads / Second ~150 K ~60 K ~115 K ~23 K 16

SATA... Need I say more? Couldn t get more than ~120k IOPS over the onboard SATA bus, no matter what we tried 17

Slow wimpies Prior results: Wimpies dominated in efficiency What s happening here? 23k vs 60k add_disk_randomness(rq->rq_disk); 23,000 interrupts/second tester program called gettimeofday Fixed these, new interrupt coalescing: 37k and rising 18

Sorting Similar results using NSort But flash-aware can clobber NSort (talk Sort Efficiency Comparison offline) 3.5 Sort Efficiency (MBpJ) MB sorted / Joule 3 2.5 2 1.5 1 0.5 0 Atom x25-e Atom+X25E i7-desktop+4-x25e 19 Sort Efficiency i7 4x x25e i7-svr+fusionio 2x xeon FusionIO

Data structures One idea you ve seen: mutable bits through re-programming; Rivest punch-cards 82, Grupp, Yaakobi, Mitz, more Can do even better for particular data types... Flash should be an ideal add-only Bloom filter (Set membership with one-sided error: Will tell you if X is in set, may lie and say it is) Caching works poorly for Blooms (random access) Very important for data-mining, etc. But all need bit-level access to Flash... 20

Where we re going (?) (PCM??) + Bandwidth + Latency + Power ---- Capacity Requires even more mem-efficient systems 21

The FAWN Perspective Pretending Flash is disk or DRAM misses opportunities Making Flash look like disk or DRAM hides opportunities Today s kernels handle high block IOPS poorly (... and we need to fix this) Algorithms exploiting re-programmability, semirandom writes can win big But want to leave the system usable and abstractions manageable 22