BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting November 6, 2015 1 Big data analytics Analysis of previously unimaginable amount of data can provide deep insight Google has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC) Analyzing personal genome can determine predisposition to diseases Social network chatter analysis can identify political revolutions before newspapers Scientific datasets can be mined to extract accurate models Likely to be the biggest economic driver for the IT industry for the next decade 2 1
A currently popular solution: RAM Cloud Cluster of machines with large DRAM capacity and fast interconnect + Fastest as long as data fits in DRAM - Power hungry and expensive - Performance drops when data doesn t fit in DRAM What if enough DRAM isn t affordable? -based solutions may be a better alternative + Faster than Disk, cheaper than DRAM + Lower power consumption than both - Legacy storage access interface is burdening - Slower than DRAM 3 Latency profile of distributed flash-based analytics Distributed processing involves many system components device access Storage software (OS, FTL, ) interface (10gE, Infiniband, ) Actual processing Access 75 μs Storage Software 100 μs 20 μs Processing 50~100 μs 100~1000 μs 20~1000 μs Latency is additive 4 2
Latency profile of distributed flash-based analytics Architectural modifications can remove unnecessary overhead Near-storage processing Cross-layer optimization of flash management software * Dedicated storage area network Accelerator Access 75 μs 50~100 μs < 20μs Difficult to explore using flash packaged as off-the-shelf SSDs 5 Custom flash card had to be built To VC707 HPC FMC PORT Artix 7 FPGA Bus 0 Bus 1 Bus 2 Bus 3 Ports Array (on both side) 6 3
BlueDBM: Platform with near-storage processing and inter-controller networks 20 24-core Xeon Servers 20 BlueDBM Storage devices 1TB flash storage x4 20Gbps controller network Xilinx VC707 2GB/s PCIe 7 BlueDBM: Platform with near-storage processing and inter-controller networks 1 of 2 Racks (10 Nodes) BlueDBM Storage Device 20 24-core Xeon Servers 20 BlueDBM Storage devices 1TB flash storage x4 20Gbps controller network Xilinx VC707 2GB/s PCIe 8 4
BlueDBM node architecture Device Controller In-Storage Processor PCIe Interface Lightweight flash management with very low overhead Custom Adds almost network no latency protocol with low ECC latency/high support bandwidth x4 20Gbps links at 0.5us latency Software has very low level Virtual channels with flow control access to flash storage High level information can be used for low level management FTL implemented inside file system Host Server 9 Power consumption is low Component Power (Watts) VC707 30 Board (x2) 10 Storage Device Total 40 Storage device power consumption is a very conservative estimate Component Power (Watts) Storage Device 40 Xeon Server 200+ Node Total 240+ GPU-based accelerator will double the power 10 5
Applications High-dimensional nearest neighbor search * Faster flash with accelerators as replacement for DRAM-based systems BlueCache An accelerated memcached * Dedicated network and accelerated caching systems with larger capacity Graph analytics Benefits of lower latency access into distributed flash for computation on large graphs * Results obtained since the paper submission 11 Image search accelerator Sang woo Jun, Chanwoo Chung BlueDBM + FPGA CPU Bottleneck BlueDBM + CPU Off-the shelf M.2. SSD Faster flash with acceleration can perform at DRAM speed 12 6
Bluecache: Accelerated memcached service Shuotao Xu Throughput (KOps per seconds) 350 300 250 200 150 100 50 0 Key size = 64 Bytes, Value size = 8K Bytes 5ms penalty per cache miss * Assuming no cache misses for Bluecache Bluecache Memcached+ Local DRAM 0 5 10 15 20 25 30 35 40 45 50 Cache misses (%) High cache-hit rate outweighs slow flashaccesses (small DRAM vs. large ) 13 Graph traversal performance Nodes traversed per second 18000 16000 14000 12000 10000 8000 6000 4000 2000 DRAM All DRAM accesses are remote, but use BlueDBM network as opposed to Ethernet 0 Software+DRAM Software + Separate Software + Controller Accelerator + Controller based system can achieve comparable performance with a much smaller cluster 14 7
Conclusion Fast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big Data Reducing access latency for distributed storage requires architectural modifications, including in-storage processors and fast storage networks -based analytics hold a lot of promise, and we plan to continue demonstrating more application acceleration Thank you 15 8